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Preface 


This  book  is  intended  for  a  first  year  graduate  course  in  econometrics.  I  tried  to  strike  a  balance 
between  a  rigorous  approach  that  proves  theorems,  and  a  completely  empirical  approach  where 
no  theorems  are  proved.  Some  of  the  strengths  of  this  book  lie  in  presenting  some  difficult 
material  in  a  simple,  yet  rigorous  manner.  For  example,  Chapter  12  on  pooling  time-series  of 
cross-section  data  is  drawn  from  my  area  of  expertise  in  econometrics  and  the  intent  here  is  to 
make  this  material  more  accessible  to  the  general  readership  of  econometrics. 

This  book  teaches  some  of  the  basic  econometric  methods  and  the  underlying  assumptions 
behind  them.  Estimation,  hypotheses  testing  and  prediction  are  three  recurrent  themes  in 
this  book.  Some  uses  of  econometric  methods  include  (i)  empirical  testing  of  economic  the¬ 
ory,  whether  it  is  the  permanent  income  consumption  theory  or  purchasing  power  parity,  (ii) 
forecasting,  whether  it  is  GNP  or  unemployment  in  the  U.S.  economy  or  future  sales  in  the  com¬ 
puter  industry,  (iii)  Estimation  of  price  elasticities  of  demand,  or  returns  to  scale  in  production. 
More  importantly,  econometric  methods  can  be  used  to  simulate  the  effect  of  policy  changes 
like  a  tax  increase  on  gasoline  consumption,  or  a  ban  on  advertising  on  cigarette  consumption. 

It  is  left  to  the  reader  to  choose  among  the  available  econometric/statistical  software  to  use, 
like  EViews,  SAS,  Stata,  TSP,  SHAZAM,  Microfit,  PcGive,  LIMDEP,  and  RATS,  to  mention 
a  few.  The  empirical  illustrations  in  the  book  utilize  a  variety  of  these  software  packages  but 
mostly  with  Stata  and  EViews.  Of  course,  these  packages  have  different  advantages  and  disad¬ 
vantages.  However,  for  the  basic  coverage  in  this  book,  these  differences  may  be  minor  and  more 
a  matter  of  what  software  the  reader  is  familiar  or  comfortable  with.  In  most  cases,  I  encourage 
my  students  to  use  more  than  one  of  these  packages  and  to  verify  these  results  using  simple 
programming  languages  like  GAUSS,  OX,  R  and  MATLAB. 

This  book  is  not  meant  to  be  encyclopedic.  I  did  not  attempt  the  coverage  of  Bayesian 
econometrics  simply  because  it  is  not  my  comparative  advantage.  The  reader  should  consult 
Koop  (2003)  for  a  more  recent  treatment  of  the  subject.  Nonparametrics  and  semiparametrics 
are  popular  methods  in  today’s  econometrics,  yet  they  are  not  covered  in  this  book  to  keep 
the  technical  difficulty  at  a  low  level.  These  are  a  must  for  a  follow-up  course  in  econometrics, 
see  Li  and  Racine  (2007).  Also,  for  a  more  rigorous  treatment  of  asymptotic  theory,  see  White 
(1984).  Despite  these  limitations,  the  topics  covered  in  this  book  are  basic  and  necessary  in  the 
training  of  every  economist.  In  fact,  it  is  but  a  ‘stepping  stone’,  a  ‘sample  of  the  good  stuff’  the 
reader  will  find  in  this  young,  energetic  and  ever  evolving  field. 

I  hope  you  will  share  my  enthusiasm  and  optimism  in  the  importance  of  the  tools  you  will 
learn  when  you  are  through  reading  this  book.  Hopefully,  it  will  encourage  you  to  consult  the 
suggested  readings  on  this  subject  that  are  referenced  at  the  end  of  each  chapter.  In  his  inaugural 
lecture  at  the  University  of  Birmingham,  entitled  “Econometrics:  A  View  from  the  Toolroom,” 
Peter  C.B.  Phillips  (1977)  concluded: 

“the  toolroom  may  lack  the  glamour  of  economics  as  a  practical  art  in  government 
or  business,  but  it  is  every  bit  as  important.  For  the  tools  (econometricians)  fashion 
provide  the  key  to  improvements  in  our  quantitative  information  concerning  matters 
of  economic  policy.  ” 
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As  a  student  of  econometrics,  I  have  benefited  from  reading  Johnston  (1984),  Kmenta  (1986), 
Theil  (1971),  Klein  (1974),  Maddala  (1977),  and  Judge,  et  al.  (1985),  to  mention  a  few.  As  a 
teacher  of  undergraduate  econometrics,  I  have  learned  from  Kelejian  and  Oates  (1989),  Wallace 
and  Silver  (1988),  Maddala  (1992),  Kennedy  (1992),  Wooldridge  (2003)  and  Stock  and  Watson 
(2003).  As  a  teacher  of  graduate  econometrics  courses,  Greene  (1993),  Judge,  et  al.  (1985), 
Fomby,  Hill  and  Johnson  (1984)  and  Davidson  and  MacKinnon  (1993)  have  been  my  regular 
companions.  The  influence  of  these  books  will  be  evident  in  the  pages  that  follow.  Courses 
requiring  matrix  algebra  as  a  pre-requisite  to  econometrics  can  start  with  Chapter  7.  Chapter  2 
has  a  quick  refresher  on  some  of  the  required  background  needed  from  statistics  for  the  proper 
understanding  of  the  material  in  this  book. 

For  an  advanced  undergraduate/masters  class  not  requiring  matrix  algebra,  one  can  structure 
a  course  based  on  Chapter  1;  Section  2.6  on  descriptive  statistics;  Chapters  3-6;  Section  11.1 
on  simultaneous  equations;  and  Chapter  14  on  time-series  analysis. 

The  exercises  contain  theoretical  problems  that  should  supplement  the  understanding  of  the 
material  in  each  chapter.  Some  of  these  exercises  are  drawn  from  the  Problems  and  Solutions 
series  of  Econometric  Theory  (reprinted  with  permission  of  Cambridge  University  Press).  In 
addition,  the  book  has  a  set  of  empirical  illustrations  demonstrating  some  of  the  basic  results 
learned  in  each  chapter.  Data  sets  from  published  articles  are  provided  for  the  empirical  exer¬ 
cises.  These  exercises  are  solved  using  several  econometric  software  packages  and  are  available 
in  the  Solution  Manual.  This  book  is  by  no  means  an  applied  econometrics  text,  and  the  reader 
should  consult  Berndt’s  (1991)  textbook  for  an  excellent  treatment  of  this  subject.  Instructors 
and  students  are  encouraged  to  get  other  data  sets  from  the  internet  or  journals  that  provide 
backup  data  sets  to  published  articles.  The  Journal  of  Applied  Econometrics  and  the  Jour¬ 
nal  of  Business  and  Economic  Statistics  are  two  such  journals.  In  fact,  the  Journal  of  Applied 
Econometrics  has  a  replication  section  for  which  I  am  serving  as  an  editor.  In  my  econometrics 
course,  I  require  my  students  to  replicate  an  empirical  paper.  Many  students  find  this  experience 
rewarding  in  terms  of  giving  them  hands  on  application  of  econometric  methods  that  prepare 
them  for  doing  their  own  empirical  work. 

I  would  like  to  thank  my  teachers  Lawrence  R.  Klein,  Roberto  S.  Mariano  and  Robert  Shiller 
who  introduced  me  to  this  field;  James  M.  Griffin  who  provided  some  data  sets,  empirical 
exercises  and  helpful  comments,  and  many  colleagues  who  had  direct  and  indirect  influence 
on  the  contents  of  this  book  including  G.S.  Maddala,  Jan  Kmenta,  Peter  Schmidt,  Cheng 
Hsiao,  Tom  Wansbeek,  Walter  Kramer,  Maxwell  King,  Peter  C.  B.  Phillips,  Alberto  Holly,  Essie 
Maasoumi,  Aris  Spanos,  Farshid  Vahid,  Heather  Anderson,  Arnold  Zellner  and  Bryan  Brown. 
Also,  I  would  like  to  thank  my  students  Wei- Wen  Xiong,  Ming-Jang  Weng,  Kiseok  Nam,  Dong 
Li,  Gustavo  Sanchez,  Long  Liu  and  Liu  Tian  who  read  parts  of  this  book  and  solved  several  of 
the  exercises.  Martina  Bihn  at  Springer  for  her  continuous  support  and  professional  editorial 
help.  I  have  also  benefited  from  my  visits  to  the  University  of  Arizona,  University  of  California 
San-Diego,  Monash  University,  the  University  of  Zurich,  the  Institute  of  Advanced  Studies  in 
Vienna,  and  the  University  of  Dortmund,  Germany.  A  special  thanks  to  my  wife  Phyllis  whose 
help  and  support  were  essential  to  completing  this  book. 


Preface  IX 


References 

Berndt,  E.R.  (1991),  The  Practice  of  Econometrics:  Classic  and  Contemporary  (Addison- Wesley:  Read¬ 
ing,  MA). 

Davidson,  R.  and  J.G.  MacKinnon  (1993),  Estimation  and  Inference  In  Econometrics  (Oxford  University 
Press:  Oxford,  MA). 

Fomby,  T.B.,  R.C.  Hill  and  S.R.  Johnson  (1984),  Advanced  Econometric  Methods  (Springer- Verlag:  New 
York) . 

Greene,  W.H.  (1993),  Econometric  Analysis  (Macmillan:  New  York  ). 

Johnston,  J.  (1984),  Econometric  Methods ,  3rd.  Ed.,  (McGraw-Hill:  New  York). 

Judge,  G.G.,  W.E.  Griffiths,  R.C.  Hill,  H.  Liitkepohl  and  T.C.  Lee  (1985),  The  Theory  and  Practice  of 
Econometrics  2nd  Ed.,  (John  Wiley:  New  York). 

Kelejian,  H.  and  W.  Oates  (1989),  Introduction  to  Econometrics:  Principles  and  Applications  2nd  Ed., 
(Harper  and  Row:  New  York). 

Kennedy,  P.  (1992),  A  Guide  to  Econometrics  (The  MIT  Press:  Cambridge,  MA). 

Klein,  L.R.  (1974),  A  Textbook  of  Econometrics  (Prentice-Hall:  New  Jersey). 

Kmenta,  J.  (1986),  Elements  of  Econometrics  2nd  Ed.,  (Macmillan:  New  York). 

Koop,  G.  (2003),  Bayesian  Econometrics  (Wiley:  New  York). 

Li,  Q.  and  J.S.  Racine  (2007),  N onparametric  Econometrics,  (Princeton  University  Press:  New  Jersey). 

Maddala,  G.S.  (1977),  Econometrics  (McGraw-Hill:  New  York). 

Maddala,  G.S.  (1992),  Introduction  to  Econometrics  (Macmillan:  New  York). 

Phillips,  P.C.B.  (1977),  “Econometrics:  A  View  From  the  Toolroom,”  Inaugural  Lecture,  University  of 
Birmingham,  Birmingham,  England. 

Stock,  J.H.  and  M.W.  Watson  (2003),  Introduction  to  Econometrics  (Addison- Wesley:  New  York). 

Theil,  H.  (1971),  Principles  of  Econometrics  (John  Wiley:  New  York). 

Wallace,  T.D.  and  L.  Silver  (1988),  Econometrics:  An  Introduction  (Addison- Wesley:  New  York). 

White,  H.  (1984),  Asymptotic  Theory  for  Econometrics  (Academic  Press:  Florida). 

Wooldridge,  J.M.  (2003),  Introductory  Econometrics  (South-Western:  Ohio). 


Data 

The  data  sets  used  in  this  text  can  be  downloaded  from  the  Springer  website  in  Germany. 
The  address  is:  http://www.springer.com/978-3-642-20058-8.  Please  select  the  link  “Samples  & 
Supplements”  from  the  right-hand  column. 


Table  of  Contents 


Preface  VII 

Part  I  1 

1  What  Is  Econometrics?  3 

1.1  Introduction .  3 

1.2  A  Brief  History  .  5 

1.3  Critiques  of  Econometrics .  7 

1.4  Looking  Ahead .  8 

Notes  .  10 

References .  10 

2  Basic  Statistical  Concepts  13 

2.1  Introduction .  13 

2.2  Methods  of  Estimation .  13 

2.3  Properties  of  Estimators  .  16 

2.4  Hypothesis  Testing  .  21 

2.5  Confidence  Intervals .  30 

2.6  Descriptive  Statistics .  31 

Notes  .  36 

Problems  .  36 

References . 42 

Appendix . 42 

3  Simple  Linear  Regression  49 

3.1  Introduction . 49 

3.2  Least  Squares  Estimation  and  the  Classical  Assumptions .  50 

3.3  Statistical  Properties  of  Least  Squares .  55 

3.4  Estimation  of  a2 .  56 

3.5  Maximum  Likelihood  Estimation .  57 

3.6  A  Measure  of  Fit  .  58 

3.7  Prediction .  60 

3.8  Residual  Analysis .  60 

3.9  Numerical  Example .  63 

3.10  Empirical  Example .  64 

Problems  .  67 

References .  71 

Appendix .  72 

4  Multiple  Regression  Analysis  73 

4.1  Introduction .  73 


XII  Table  of  Contents 


4.2  Least  Squares  Estimation .  73 

4.3  Residual  Interpretation  of  Multiple  Regression  Estimates .  75 

4.4  Overspecification  and  Underspecification  of  the  Regression  Equation .  76 

4.5  R-Squared  Versus  R-Bar-Squared .  78 

4.6  Testing  Linear  Restrictions .  78 

4.7  Dummy  Variables .  81 

Note .  85 

Problems  .  85 

References .  91 

Appendix .  92 

5  Violations  of  the  Classical  Assumptions  95 

5.1  Introduction .  95 

5.2  The  Zero  Mean  Assumption .  95 

5.3  Stochastic  Explanatory  Variables  .  96 

5.4  Normality  of  the  Disturbances .  98 

5.5  Heteroskedasticity .  98 

5.6  Autocorrelation . 110 

Notes  . 120 

Problems  . 120 

References . 126 

6  Distributed  Lags  and  Dynamic  Models  131 

6.1  Introduction . 131 

6.2  Infinite  Distributed  Lag . 137 

6.2.1  Adaptive  Expectations  Model  (AEM) . 138 

6.2.2  Partial  Adjustment  Model  (PAM) . 138 

6.3  Estimation  and  Testing  of  Dynamic  Models  with  Serial  Correlation . 139 

6.3.1  A  Lagged  Dependent  Variable  Model  with  AR(1)  Disturbances . 140 

6.3.2  A  Lagged  Dependent  Variable  Model  with  MA(1)  Disturbances . 142 

6.4  Autoregressive  Distributed  Lag . 143 

Note . 144 

Problems  . 144 

References . 146 

Part  II  149 

7  The  General  Linear  Model:  The  Basics  151 

7.1  Introduction . 151 

7.2  Least  Squares  Estimation . 151 

7.3  Partitioned  Regression  and  the  Frisch- Waugh-Lovell  Theorem . 154 

7.4  Maximum  Likelihood  Estimation . 156 

7.5  Prediction . 159 

7.6  Confidence  Intervals  and  Test  of  Hypotheses . 160 

7.7  Joint  Confidence  Intervals  and  Test  of  Hypotheses . 160 


Table  of  Contents  XIII 


7.8  Restricted  MLE  and  Restricted  Least  Squares . 161 

7.9  Likelihood  Ratio,  Wald  and  Lagrange  Multiplier  Tests . 162 

Notes  . 167 

Problems  . 167 

References . 172 

Appendix . 173 

8  Regression  Diagnostics  and  Specification  Tests  179 

8.1  Influential  Observations . 179 

8.2  Recursive  Residuals . 187 

8.3  Specification  Tests . 196 

8.4  Nonlinear  Least  Squares  and  the  Gauss-Newton  Regression . 206 

8.5  Testing  Linear  Versus  Log-Linear  Functional  Form . 213 

Notes  . 215 

Problems  . 215 

References . 219 

9  Generalized  Least  Squares  223 

9.1  Introduction . 223 

9.2  Generalized  Least  Squares  . 223 

9.3  Special  Forms  of  Q  . 225 

9.4  Maximum  Likelihood  Estimation . 226 

9.5  Test  of  Hypotheses  . 226 

9.6  Prediction . 227 

9.7  Unknown  Q  . 227 

9.8  The  W,  LR  and  LM  Statistics  Revisited  . 228 

9.9  Spatial  Error  Correlation . 230 

Note . 231 

Problems  . 232 

References . 237 

10  Seemingly  Unrelated  Regressions  241 

10.1  Introduction . 241 

10.2  Feasible  GLS  Estimation . 243 

10.3  Testing  Diagonality  of  the  Variance-Covariance  Matrix . 246 

10.4  Seemingly  Unrelated  Regressions  with  Unequal  Observations . 246 

10.5  Empirical  Examples . 248 

Problems  . 251 

References . 254 

11  Simultaneous  Equations  Model  257 

11.1  Introduction . 257 

11.1.1  Simultaneous  Bias . 257 

11.1.2  The  Identification  Problem . 260 

11.2  Single  Equation  Estimation:  Two-Stage  Least  Squares . 263 

11.2.1  Spatial  Lag  Dependence . 271 


XIV  Table  of  Contents 


11.3  System  Estimation:  Three-Stage  Least  Squares . 272 

11.4  Test  for  Over-Identification  Restrictions . 273 

11.5  Hausman’s  Specification  Test . 275 

11.6  Empirical  Examples . 278 

Notes  . 285 

Problems  . 286 

References . 296 

Appendix . 298 

12  Pooling  Time-Series  of  Cross-Section  Data  305 

12.1  Introduction . 305 

12.2  The  Error  Components  Model . 305 

12.2.1  The  Fixed  Effects  Model . 306 

12.2.2  The  Random  Effects  Model  . 308 

12.2.3  Maximum  Likelihood  Estimation . 312 

12.3  Prediction . 313 

12.4  Empirical  Example . 313 

12.5  Testing  in  a  Pooled  Model . 317 

12.6  Dynamic  Panel  Data  Models . 321 

12.6.1  Empirical  Illustration . 324 

12.7  Program  Evaluation  and  Difference- in-Differences  Estimator . 326 

12.7.1  The  Difference-in-Differences  Estimator . 327 

Problems  . 327 

References . 330 

13  Limited  Dependent  Variables  333 

13.1  Introduction . 333 

13.2  The  Linear  Probability  Model . 333 

13.3  Functional  Form:  Logit  and  Probit . 334 

13.4  Grouped  Data . 336 

13.5  Individual  Data:  Probit  and  Logit . 341 

13.6  The  Binary  Response  Model  Regression . 342 

13.7  Asymptotic  Variances  for  Predictions  and  Marginal  Effects . 344 

13.8  Goodness  of  Fit  Measures . 344 

13.9  Empirical  Examples . 345 

13.10  Multinomial  Choice  Models . 350 

13.10.1  Ordered  Response  Models . 350 

13.10.2  Unordered  Response  Models . 354 

13.11  The  Censored  Regression  Model . 356 

13.12  The  Truncated  Regression  Model  . 359 

13.13  Sample  Selectivity . 360 

Notes  . 362 

Problems  . 362 

References . 368 

Appendix . 370 


Table  of  Contents  XV 


14  Time-Series  Analysis  373 

14.1  Introduction . 373 

14.2  Stationarity  . 373 

14.3  The  Box  and  Jenkins  Method  . 375 

14.4  Vector  Autoregression . 378 

14.5  Unit  Roots . 379 

14.6  Trend  Stationary  Versus  Difference  Stationary . 383 

14.7  Cointegration  . 384 

14.8  Autoregressive  Conditional  Heteroskedasticity . 387 

Note . 390 

Problems  . 390 

References . 394 

Appendix  397 

List  of  Figures  403 

List  of  Tables  405 

Index  407 


Part  I 


CHAPTER  1 

What  Is  Econometrics? 


1.1  Introduction 

What  is  econometrics?  A  few  definitions  are  given  below: 

The  method  of  econometric  research  aims,  essentially,  at  a  conjunction  of  economic 
theory  and  actual  measurements,  using  the  theory  and  technique  of  statistical  infer¬ 
ence  as  a  bridge  pier. 

Trygve  Haavelmo  (1944) 

Econometrics  may  be  defined  as  the  quantitative  analysis  of  actual  economic  phe¬ 
nomena  based  on  the  concurrent  development  of  theory  and  observation,  related  by 
appropriate  methods  of  inference. 

Samuelson,  Koopmans  and  Stone  (1954) 

Econometrics  is  concerned  with  the  systematic  study  of  economic  phenomena  using 
observed  data. 

Aris  Spanos  (1986) 

Broadly  speaking,  econometrics  aims  to  give  empirical  content  to  economic  relations 
for  testing  economic  theories,  forecasting,  decision  making,  and  for  ex  post  deci¬ 
sion/policy  evaluation. 

J.  Geweke,  J.  Horowitz,  and  M.H.  Pesaran  (2008) 

For  other  definitions  of  econometrics,  see  Tintner  (1953). 

An  econometrician  has  to  be  a  competent  mathematician  and  statistician  who  is  an  economist 
by  training.  Fundamental  knowledge  of  mathematics,  statistics  and  economic  theory  are  a  nec¬ 
essary  prerequisite  for  this  field.  As  Ragnar  Frisch  (1933)  explains  in  the  first  issue  of  Econo- 
nretrica,  it  is  the  unification  of  statistics,  economic  theory  and  mathematics  that  constitutes 
econometrics.  Each  view  point,  by  itself  is  necessary  but  not  sufficient  for  a  real  understanding 
of  quantitative  relations  in  modern  economic  life. 

Ragnar  Frisch  is  credited  with  coining  the  term  ‘econometrics’  and  he  is  one  of  the  founders 
of  the  Econometrics  Society,  see  Christ  (1983).  Econometrics  aims  at  giving  empirical  content 
to  economic  relationships.  The  three  key  ingredients  are  economic  theory,  economic  data,  and 
statistical  methods.  Neither  ‘theory  without  measurement’,  nor  ‘measurement  without  theory’ 
are  sufficient  for  explaining  economic  phenomena.  It  is  as  Frisch  emphasized  their  union  that  is 
the  key  for  success  in  the  future  development  of  econometrics. 

Lawrence  R.  Klein,  the  1980  recipient  of  the  Nobel  Prize  in  economics  “for  the  creation  of 
econometric  models  and  their  application  to  the  analysis  of  economic  fluctuations  and  economic 
policies,”1  has  always  emphasized  the  integration  of  economic  theory,  statistical  methods  and 
practical  economics.  The  exciting  thing  about  econometrics  is  its  concern  for  verifying  or  refuting 
economic  laws,  such  as  purchasing  power  parity,  the  life  cycle  hypothesis,  the  quantity  theory  of 
money,  etc.  These  economic  laws  or  hypotheses  are  testable  with  economic  data.  In  fact,  David 
F.  Hendry  (1980)  emphasized  this  function  of  econometrics: 

B.H.  Baltagi,  Econometrics,  Springer  Texts  in  Business  and  Economics,  DOI  10.1007/978-3-642-20059-5  1,  3 
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The  three  golden  rules  of  econometrics  are  test,  test  and  test;  that  all  three  rules  are 
broken  regularly  in  empirical  applications  is  fortunately  easily  remedied.  Rigorously 
tested  models,  which  adequately  described  the  available  data,  encompassed  previous 
findings  and  were  derived  from  well  based  theories  would  enhance  any  claim  to  be 
scientific. 

Econometrics  also  provides  quantitative  estimates  of  price  and  income  elasticities  of  demand, 
returns  to  scale  in  production,  technical  efficiency  in  cost  functions,  wage  elasticities,  etc.  These 
are  important  for  policy  decision  making.  What  is  the  effect  of  raising  the  tax  on  a  pack  of 
cigarettes  by  10%  in  reducing  smoking?  How  much  will  it  generate  in  tax  revenues?  What  is  the 
effect  of  raising  minimum  wage  by  $1  per  hour  on  unemployment?  What  is  the  effect  of  raising 
beer  tax  on  motor  vehicle  fatality? 

Econometrics  also  provides  predictions  about  future  interest  rates,  unemployment,  or  GNP 
growth.  Lawrence  Klein  (1971)  emphasized  this  last  function  of  econometrics: 

Econometrics  had  its  origin  in  the  recognition  of  empirical  regularities  and  the  sys¬ 
tematic  attempt  to  generalize  these  regularities  into  “laws”  of  economics.  In  a  broad 
sense,  the  use  of  such  “laws”  is  to  make  predictions  -  about  what  might  have  or  what 
will  come  to  pass.  Econometrics  should  give  a  base  for  economic  prediction  beyond 
experience  if  it  is  to  be  useful.  In  this  broad  sense  it  may  be  called  the  science  of 
economic  prediction. 

Econometrics,  while  based  on  scientific  principles,  still  retains  a  certain  element  of  art.  According 
to  Malinvaud  (1966),  the  art  in  econometrics  is  trying  to  find  the  right  set  of  assumptions 
which  are  sufficiently  specific,  yet  realistic  to  enable  us  to  take  the  best  possible  advantage  of 
the  available  data.  Data  in  economics  are  not  generated  under  ideal  experimental  conditions 
as  in  a  physics  laboratory.  This  data  cannot  be  replicated  and  is  most  likely  measured  with 
error.  In  some  cases,  the  available  data  are  proxies  for  variables  that  are  either  not  observed  or 
cannot  be  measured.  Many  published  empirical  studies  find  that  economic  data  may  not  have 
enough  variation  to  discriminate  between  two  competing  economic  theories.  Manski  (1995,  p.  8) 
argues  that 

Social  scientists  and  policymakers  alike  seem  driven  to  draw  sharp  conclusions,  even 
when  these  can  be  generated  only  by  imposing  much  stronger  assumptions  than  can 
be  defended.  We  need  to  develop  a  greater  tolerance  for  ambiguity.  We  must  face  up 
to  the  fact  that  we  cannot  answer  all  of  the  questions  that  we  ask. 

To  some,  the  “art”  element  in  econometrics  has  left  a  number  of  distinguished  economists  doubt¬ 
ful  of  the  power  of  econometrics  to  yield  sharp  predictions.  In  his  presidential  address  to  the 
American  Economic  Association,  Wassily  Leontief  (1971,  pp.  2-3)  characterized  econometrics 
work  as: 

an  attempt  to  compensate  for  the  glaring  weakness  of  the  data  base  available  to  us 
by  the  widest  possible  use  of  more  and  more  sophisticated  techniques.  Alongside  the 
mounting  pile  of  elaborate  theoretical  models  we  see  a  fast  growing  stock  of  equally 
intricate  statistical  tools.  These  are  intended  to  stretch  to  the  limit  the  meager  supply 
of  facts. 
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Economic  data  can  be  of  the  cross-section  type,  for  e.g.,  a  sample  of  firms  or  households  or 
countries  at  a  particular  point  in  time.  An  important  data  source  is  the  Current  Population 
Survey.  This  is  a  monthly  survey  of  50,000  households  in  the  U.S.  which  is  used  to  estimate  the 
unemployment  rate.  Data  can  also  be  of  the  time-series  type,  for  e.g.,  macroeconomic  variables 
like  Gross  Domestic  Product  (GDP),  Personal  Disposable  Income,  Consumption,  Government 
Expenditures,  etc.  for  the  U.S.  observed  over  the  last  40  years.  These  can  be  found  in  the 
Economic  Report  of  the  President.  See  Chapter  14  for  some  basic  time-series  methods  in  econo¬ 
metrics.  Data  can  also  be  following  a  group  of  households,  firms,  or  countries  over  time,  i.e. , 
Longitudinal  data  or  panel  data.  The  National  Longitudinal  Survey  of  Youth,  1979  consists  of  a 
nationally  representative  sample  of  12686  young  men  and  women  who  were  14-22  years  old  in 
1979.  These  individuals  were  interviewed  annually  through  1994  and  currently  interviewed  on 
a  biennial  basis.  The  list  of  variables  include  information  on  schooling  and  career  transitions, 
marriage  and  fertility,  training  investments,  child  care  usage  and  drug  and  alcohol  use.  See 
Chapter  12  for  some  basic  panel  data  methods  in  econometrics. 

Most  of  the  time  the  data  collected  are  not  ideal  for  the  economic  question  at  hand  because 
they  were  posed  to  answer  legal  requirements  or  comply  to  regulatory  agencies.  Griliches  (1986, 
p.  1466)  describes  the  situation  as  follows: 

Econometricians  have  an  ambivilant  attitude  towards  economic  data.  At  one  level, 
the  ‘data’  are  the  world  that  we  want  to  explain,  the  basic  facts  that  economists 
purport  to  elucidate.  At  the  other  level,  they  are  the  source  of  all  our  trouble.  Their 
imperfections  make  our  job  difficult  and  often  impossible...  We  tend  to  forget  that 
these  imperfections  are  what  gives  us  our  legitimacy  in  the  first  place...  Given  that 
it  is  the  ‘badness  ’  of  the  data  that  provides  us  with  our  living,  perhaps  it  is  not  all 
that  surprising  that  we  have  shown  little  interest  in  improving  it,  in  getting  involved 
in  the  grubby  task  of  designing  and  collecting  original  data  sets  of  our  own.  Most  of 
our  work  is  on  ‘found’  data,  data  that  have  been  collected  by  somebody  else,  often 
for  quite  different  purposes. 

Even  though  economists  are  increasingly  getting  involved  in  collecting  their  data  and  measuring 
variables  more  accurately  and  despite  the  increase  in  data  sets  and  data  storage  and  computa¬ 
tional  accuracy,  some  of  the  warnings  given  by  Griliches  (1986,  p.  1468)  are  still  valid  today: 

The  encounters  between  econometricians  and  data  are  frustrating  and  ultimately 
unsatisfactory  both  because  econometricians  want  too  much  from  the  data  and  hence 
tend  to  be  dissappointed  by  the  answers,  and  because  the  data  are  incomplete  and 
imperfect.  In  part  it  is  our  fault,  the  appetite  grows  with  eating.  As  we  get  larger 
samples,  we  keep  adding  variables  and  expanding  our  models,  until  on  the  margin, 
we  come  back  to  the  same  insignificance  levels. 


1.2  A  Brief  History 

For  a  brief  review  of  the  origins  of  econometrics  before  World  War  II  and  its  development  in  the 
1940-1970  period,  see  Klein  (1971).  Klein  gives  an  interesting  account  of  the  pioneering  works 
of  Moore  (1914)  on  economic  cycles,  Working  (1927)  on  demand  curves,  Cobb  and  Douglas 
(1928)  on  the  theory  of  production,  Schultz  (1938)  on  the  theory  and  measurement  of  demand, 
and  Tinbergen  (1939)  on  business  cycles.  As  Klein  (1971,  p.  415)  adds: 
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The  works  of  these  men  mark  the  beginnings  of  formal  econometrics.  Their  analysis 
was  systematic,  based  on  the  joint  foundations  of  statistical  and  economic  theory, 
and  they  were  aiming  at  meaningful  substantive  goals  -  to  measure  demand  elasticity, 
marginal  productivity  and  the  degree  of  macroeconomic  stability. 

The  story  of  the  early  progress  in  estimating  economic  relationships  in  the  U.S.  is  given  in  Christ 
(1985).  The  modern  era  of  econometrics,  as  we  know  it  today,  started  in  the  1940’s.  Klein  (1971) 
attributes  the  formulation  of  the  econometrics  problem  in  terms  of  the  theory  of  statistical 
inference  to  Haavelmo  (1943,  1944)  and  Mann  and  Wald  (1943).  This  work  was  extended  later  by 
T.C.  Koopmans,  J.  Marschak,  L.  Hurwicz,  T.W.  Anderson  and  others  at  the  Cowles  Commission 
in  the  late  1940’s  and  early  1950’s,  see  Koopmans  (1950).  Klein  (1971,  p.  416)  adds: 

At  this  time  econometrics  and  mathematical  economics  had  to  fight  for  academic 
recognition.  In  retrospect,  it  is  evident  that  they  were  growing  disciplines  and  becom¬ 
ing  increasingly  attractive  to  the  new  generation  of  economic  students  after  World 
War  II,  but  only  a  few  of  the  largest  and  most  advanced  universities  offered  formal 
work  in  these  subjects.  The  mathematization  of  economics  was  strongly  resisted. 

This  resistance  is  a  thing  of  the  past,  with  econometrics  being  an  integral  part  of  economics, 
taught  and  practiced  worldwide.  Econometrica,  the  official  journal  of  the  Econometric  Society 
is  one  of  the  leading  journals  in  economics,  and  today  the  Econometric  Society  boast  a  large 
membership  worldwide.  Today,  it  is  hard  to  read  any  professional  article  in  leading  economics 
and  econometrics  journals  without  seeing  mathematical  equations.  Students  of  economics  and 
econometrics  have  to  be  proficient  in  mathematics  to  comprehend  this  research.  In  an  Econo¬ 
metric  Theory  interview,  professor  J.  D.  Sargan  of  the  London  School  of  Economics  looks  back 
at  his  own  career  in  econometrics  and  makes  the  following  observations:  “...  econometric  theo¬ 
rists  have  really  got  to  be  much  more  professional  statistical  theorists  than  they  had  to  be  when 
I  started  out  in  econometrics  in  1948...  Of  course  this  means  that  the  starting  econometrician 
hoping  to  do  a  Ph.D.  in  this  field  is  also  finding  it  more  difficult  to  digest  the  literature  as  a 
prerequisite  for  his  own  study,  and  perhaps  we  need  to  attract  students  of  an  increasing  de¬ 
gree  of  mathematical  and  statistical  sophistication  into  our  field  as  time  goes  by,”  see  Phillips 
(1985,  pp.  134-135).  This  is  also  echoed  by  another  giant  in  the  field,  professor  T.W.  Anderson 
of  Stanford,  who  said  in  an  Econometric  Theory  interview:  “These  days  econometricians  are 
very  highly  trained  in  mathematics  and  statistics;  much  more  so  than  statisticians  are  trained 
in  economics;  and  I  think  that  there  will  be  more  cross-fertilization,  more  joint  activity,”  see 
Phillips  (1986,  p.  280). 

Research  at  the  Cowles  Commission  was  responsible  for  providing  formal  solutions  to  the 
problems  of  identification  and  estimation  of  the  simultaneous  equations  model,  see  Christ 
(1985). 2  Two  important  monographs  summarizing  much  of  the  work  of  the  Cowles  Commis¬ 
sion  at  Chicago,  are  Koopmans  and  Marschak  (1950)  and  Koopmans  and  Hood  (1953). 3  The 
creation  of  large  data  banks  of  economic  statistics,  advances  in  computing,  and  the  general 
acceptance  of  Keynesian  theory,  were  responsible  for  a  great  flurry  of  activity  in  econometrics. 
Macroeconometric  modelling  started  to  flourish  beyond  the  pioneering  macro  models  of  Klein 
(1950)  and  Klein  and  Goldberger  (1955). 

For  the  story  of  the  founding  of  Econometrica  and  the  Econometric  Society,  see  Christ  (1983). 
Suggested  readings  on  the  history  of  econometrics  are  Pesaran  (1987),  Epstein  (1987)  and 
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Morgan  (1990).  In  the  conclusion  of  her  book  on  The  History  of  Econometric  Ideas,  Morgan 
(1990;  p.  264)  explains: 

In  the  first  half  of  the  twentieth  century,  econometricians  found  themselves  carrying 
out  a  wide  range  of  tasks:  from  the  precise  mathematical  formulation  of  economic 
theories  to  the  development  tasks  needed  to  build  an  econometric  model;  from  the  ap¬ 
plication  of  statistical  methods  in  data  preperation  to  the  measurement  and  testing 
of  models.  Of  necessity,  econometricians  were  deeply  involved  in  the  creative  devel¬ 
opment  of  both  mathematical  economic  theory  and  statistical  theory  and  techniques. 
Between  the  1920s  and  the  1940s,  the  tools  of  mathematics  and  statistics  were  in¬ 
deed  used  in  a  productive  and  complementary  union  to  forge  the  essential  ideas  of  the 
econometric  approach.  But  the  changing  nature  of  the  econometric  enterprise  in  the 
1940s  caused  a  return  to  the  division  of  labour  favoured  in  the  late  nineteenth  cen¬ 
tury,  with  mathematical  economists  working  on  theory  building  and  econometricians 
concerned  with  statistical  work.  By  the  1950s  the  founding  ideal  of  econometrics,  the 
union  of  mathematical  and  statistical  economics  into  a  truly  synthetic  economics, 
had  collapsed. 

In  modern  day  usage,  econometrics  have  become  the  application  of  statistical  methods  to  eco¬ 
nomics,  like  biometrics  and  psychometrics.  Although,  the  ideals  of  Frisch  still  live  on  in  Econo- 
metrica  and  the  Econometric  Society,  Maddala  (1999)  argues  that:  “In  recent  years  the  issues 
of  Econometrica  have  had  only  a  couple  of  papers  in  econometrics  (statistical  methods  in  eco¬ 
nomics)  and  the  rest  are  all  on  game  theory  and  mathematical  economics.  If  you  look  at  the 
list  of  fellows  of  the  Econometric  Society,  you  find  one  or  two  econometricians  and  the  rest 
are  game  theorists  and  mathematical  economists.”  This  may  be  a  little  exagerated  but  it  does 
summarize  the  rift  between  modern  day  econometrics  and  mathematical  economics.  For  a  world 
wide  ranking  of  econometricians  as  well  as  academic  institutions  in  the  field  of  econometrics, 
see  Baltagi  (2007). 


1.3  Critiques  of  Econometrics 

Econometrics  has  its  critics.  Interestingly,  John  Maynard  Keynes  (1940,  p.  156)  had  the  following 
to  say  about  Jan  Tinbergen’s  (1939)  pioneering  work: 

No  one  could  be  more  frank,  more  painstaking,  more  free  of  subjective  bias  or  parti 
pris  than  Professor  Tinbergen.  There  is  no  one,  therefore,  so  far  as  human  qualities 
go,  whom  it  would  be  safer  to  trust  with  black  magic.  That  there  is  anyone  I  would 
trust  with  it  at  the  present  stage  or  that  this  brand  of  statistical  alchemy  is  ripe  to 
become  a  branch  of  science,  I  am  not  yet  persuaded.  But  Newton,  Boyle  and  Locke 
all  played  with  alchemy.  So  let  him  continue .4 

In  1969,  Jan  Tinbergen  shared  the  first  Nobel  Prize  in  economics  with  Ragnar  Frisch. 

Well  cited  critiques  of  econometrics  include  the  Lucas  (1976)  critique  which  is  based  on  the 
Rational  Expectations  Hypothesis  (REH).  As  Pesaran  (1990,  p.  17)  puts  it: 

The  message  of  the  REH  for  econometrics  was  clear.  By  postulating  that  economic 
agents  form  their  expectations  endogenously  on  the  basis  of  the  true  model  of  the 
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economy  and  a  correct  understanding  of  the  processes  generating  exogenous  variables 
of  the  model,  including  government  policy,  the  REH  raised  serious  doubts  about  the 
invariance  of  the  structural  parameters  of  the  mainstream  macroeconometric  models 
in  face  of  changes  in  government  policy. 

Responses  to  this  critique  include  Pesaran  (1987).  Other  lively  debates  among  econometricians 
include  Ed  Learner’s  (1983)  article  entitled  “Let’s  Take  the  Con  Out  of  Econometrics,”  and  the 
response  by  McAleer,  Pagan  and  Volker  (1985).  Rather  than  leave  the  reader  with  criticisms 
of  econometrics  especially  before  we  embark  on  the  journey  to  learn  the  tools  of  the  trade,  we 
conclude  this  section  with  the  following  quote  from  Pesaran  (1990,  pp.  25-26): 

There  is  no  doubt  that  econometrics  is  subject  to  important  limitations,  which  stem 
largely  from  the  incompleteness  of  the  economic  theory  and  the  non- experimental 
nature  of  economic  data.  But  these  limitations  should  not  distract  us  from  recog¬ 
nizing  the  fundamental  role  that  econometrics  has  come  to  play  in  the  development 
of  economics  as  a  scientific  discipline.  It  may  not  be  possible  conclusively  to  re¬ 
ject  economic  theories  by  means  of  econometric  methods,  but  it  does  not  mean  that 
nothing  useful  can  be  learned  from  attempts  at  testing  particular  formulations  of  a 
given  theory  against  (possible)  rival  alternatives.  Similarly,  the  fact  that  economet¬ 
ric  modelling  is  inevitably  subject  to  the  problem  of  specification  searches  does  not 
mean  that  the  whole  activity  is  pointless.  Econometric  models  are  important  tools  for 
forecasting  and  policy  analysis,  and  it  is  unlikely  that  they  will  be  discarded  in  the 
future.  The  challenge  is  to  recognize  their  limitations  and  to  work  towards  turning 
them  into  more  reliable  and  effective  tools.  There  seem  to  be  no  viable  alternatives. 


1.4  Looking  Ahead 

Econometrics  have  experienced  phenomenal  growth  in  the  past  50  years.  There  are  six  volumes 
of  the  Handbook  of  Econometrics ,  most  of  it  dealing  with  post  1960’s  research.  A  lot  of  the 
recent  growth  reflects  the  rapid  advances  in  computing  technology.  The  broad  availability  of 
micro  data  bases  is  a  major  advance  which  facilitated  the  growth  of  panel  data  methods  (see 
Chapter  12)  and  microeconometric  methods  especially  on  sample  selection  and  discrete  choice 
(see  Chapter  13)  and  that  also  lead  to  the  award  of  the  Nobel  Prize  in  Economics  to  James 
Heckman  and  Daniel  McFadden  in  2000.  The  explosion  in  research  in  time  series  econometrics 
which  lead  to  the  development  of  ARCH  and  GARCH  and  cointegration  (see  Chapter  14) 
which  also  lead  to  the  award  of  the  Nobel  Prize  in  Economics  to  Clive  Granger  and  Robert 
Engle  in  2003.  It  is  a  different  world  than  it  was  30  years  ago.  The  computing  facilities  changed 
dramatically.  The  increasing  accessibility  of  cheap  and  powerful  computing  facilities  are  helping 
to  make  the  latest  econometric  methods  more  readily  available  to  applied  researchers.  Today, 
there  is  hardly  a  field  in  economics  which  has  not  been  intensive  in  its  use  of  econometrics  in 
empirical  work.  Pagan  (1987,  p.  81)  observed  that  the  work  of  econometric  theorists  over  the 
period  1966-1986  have  become  part  of  the  process  of  economic  investigation  and  the  training 
of  economists.  Based  on  this  criterion,  he  declares  econometrics  as  an  “outstanding  success.” 
He  adds  that: 

The  judging  of  achievement  inevitably  involves  contrast  and  comparison.  Over  a 

period  of  twenty  years  this  would  be  best  done  by  interviewing  a  time-travelling 
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economist  displaced  from  1966  to  1986.  I  came  into  econometrics  just  after  the  be¬ 
ginning  of  this  period,  so  have  some  appreciation  for  what  has  occurred.  But  because 
I  have  seen  the  events  gradually  unfolding,  the  effects  upon  me  are  not  as  dramatic. 
Nevertheless,  let  me  try  to  be  a  time-traveller  and  comment  on  the  perceptions  of  a 
1966’er  landing  in  1986.  My  first  impression  must  be  of  the  large  number  of  people 
who  have  enough  econometric  and  computer  skills  to  formulate,  estimate  and  sim¬ 
ulate  highly  complex  and  non-linear  models.  Someone  who  could  do  the  equivalent 
tasks  in  1966  was  well  on  the  way  to  a  Chair.  My  next  impression  would  be  of  the 
widespread  use  and  purchase  of  econometric  services  in  the  academic,  government, 
and  private  sectors.  Quantification  is  now  the  norm  rather  than  the  exception.  A 
third  impression,  gleaned  from  a  sounding  of  the  job  market,  would  be  a  persistent 
tendency  towards  an  excess  demand  for  well-trained  econometricians.  The  economist 
in  me  would  have  to  acknowledge  that  the  market  judges  the  products  of  the  discipline 
as  a  success. 

The  challenge  for  the  21s*  century  is  to  narrow  the  gap  between  theory  and  practice.  Many 
feel  that  this  gap  has  been  widening  with  theoretical  research  growing  more  and  more  abstract 
and  highly  mathematical  without  an  application  in  sight  or  a  motivation  for  practical  use. 
Heckman  (2001)  argues  that  econometrics  is  useful  only  if  it  helps  economists  conduct  and 
interpret  empirical  research  on  economic  data.  He  warns  that  the  gap  between  econometric 
theory  and  empirical  practice  has  grown  over  the  past  two  decades.  Theoretical  econometrics 
becoming  more  closely  tied  to  mathematical  statistics.  Although  he  finds  nothing  wrong,  and 
much  potential  value,  in  using  methods  and  ideas  from  other  fields  to  improve  empirical  work 
in  economics,  he  does  warn  of  the  risks  involved  in  uncritically  adopting  the  methods  and  mind 
set  of  the  statisticians: 

Econometric  methods  uncritically  adapted  from  statistics  are  not  useful  in  many  re¬ 
search  activities  pursued  by  economists.  A  theorem-proof  format  is  poorly  suited  for 
analyzing  economic  data,  which  requires  skills  of  synthesis,  interpretation  and  em¬ 
pirical  investigation.  Command  of  statistical  methods  is  only  a  part,  and  sometimes 
a  very  small  part,  of  what  is  required  to  do  first  class  empirical  research. 

In  an  Econometric  Theory  interview  with  Jan  Tinbergen,  Magnus  and  Morgan  (1987,  p.  117) 
describe  Tinbergen  as  one  of  the  founding  fathers  of  econometrics,  publishing  in  the  field  from 
1927  until  the  early  1950s.  They  add:  “Tinbergen’s  approach  to  economics  has  always  been  a 
practical  one.  This  was  highly  appropriate  for  the  new  field  of  econometrics,  and  enabled  him 
to  make  important  contributions  to  conceptual  and  theoretical  issues,  but  always  in  the  context 
of  a  relevant  economic  problem.”  The  founding  fathers  of  econometrics  have  always  had  the 
practitioner  in  sight.  This  is  a  far  cry  from  many  theoretical  econometricians  who  refrain  from 
applied  work. 

Geweke,  Horowitz,  and  Pesaran  (2008)  provide  the  following  recommendations  for  the  future: 

Econometric  theory  and  practice  seek  to  provide  information  required  for  informed 
decision-making  in  public  and  private  economic  policy.  This  process  is  limited  not 
only  by  the  adequacy  of  econometrics,  but  also  by  the  development  of  economic  theory 
and  the  adequacy  of  data  and  other  information.  Effective  progress,  in  the  future  as 
in  the  past,  will  come  from  simultaneous  improvements  in  econometrics,  economic 
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theory,  and  data.  Research  that  specifically  addresses  the  effectiveness  of  the  interface 
between  any  two  of  these  three  in  improving  policy  —  to  say  nothing  of  all  of  them 
-  necessarily  transcends  traditional  subdisciplinary  boundaries  within  economics. 
But  it  is  precisely  these  combinations  that  hold  the  greatest  promise  for  the  social 
contribution  of  academic  economics. 


Notes 

1.  See  the  interview  of  Professor  L.R.  Klein  by  Mariano  (1987).  Econometric  Theory  publishes  inter¬ 
views  with  some  of  the  giants  in  the  field.  These  interviews  offer  a  wonderful  glimpse  at  the  life 
and  work  of  these  giants. 

2.  Simultaneous  equations  model  is  an  integral  part  of  econometrics  and  is  studied  in  Chapter  11. 

3.  Tjalling  Koopmans  was  the  joint  recipient  of  the  Nobel  Prize  in  Economics  in  1975.  In  addition 
to  his  work  on  the  identification  and  estimation  of  simultaneous  equations  models,  he  received  the 
Nobel  Prize  for  his  work  in  optimization  and  economic  theory. 

4.  I  encountered  this  attack  by  Keynes  on  Tinbergen  in  the  inaugural  lecture  that  Peter  C.B.  Phillips 
(1977)  gave  at  the  University  of  Birmingham  entitled  “Econometrics:  A  View  From  the  Toolroom,” 
and  David  F.  Hendry’s  (1980)  article  entitled  “Econometrics  -  Alchemy  or  Science?” 
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CHAPTER  2 

Basic  Statistical  Concepts 

2.1  Introduction 

One  chapter  cannot  possibly  review  what  one  learned  in  one  or  two  pre-requisite  courses  in 
statistics.  This  is  an  econometrics  book,  and  it  is  imperative  that  the  student  have  taken  at 
least  one  solid  course  in  statistics.  The  concepts  of  a  random  variable,  whether  discrete  or  contin¬ 
uous,  and  the  associated  probability  function  or  probability  density  function  (p.d.f.)  are  assumed 
known.  Similarly,  the  reader  should  know  the  following  statistical  terms:  Cumulative  distribu¬ 
tion  function,  marginal,  conditional  and  joint  p.d.f.’s.  The  reader  should  be  comfortable  with 
computing  mathematical  expectations,  and  familiar  with  the  concepts  of  independence,  Bayes 
Theorem  and  several  continuous  and  discrete  probability  distributions.  These  distributions  in¬ 
clude:  the  Bernoulli,  Binomial,  Poisson,  Geometric,  Uniform,  Normal,  Gamma,  Chi-squared 
(y2),  Exponential,  Beta,  t  and  F  distributions. 

Section  2.2  reviews  two  methods  of  estimation,  while  section  2.3  reviews  the  properties  of 
the  resulting  estimators.  Section  2.4  gives  a  brief  review  of  test  of  hypotheses,  while  section  2.5 
discusses  the  meaning  of  confidence  intervals.  These  sections  are  fundamental  background  for 
this  book,  and  the  reader  should  make  sure  that  he  or  she  is  familiar  with  these  concepts.  Also, 
be  sure  to  solve  the  exercises  at  the  end  of  this  chapter. 


2.2  Methods  of  Estimation 

Consider  a  Normal  distribution  with  mean  p  and  variance  a2.  This  is  the  important  “Gaussian” 
distribution  which  is  symmetric  and  bell-shaped  and  completely  determined  by  its  measure 
of  centrality,  its  mean  p  and  its  measure  of  dispersion,  its  variance  a2,  p  and  a2  are  called 
the  population  parameters.  Draw  a  random  sample  X\,. . . .  Xn  independent  and  identically 
distributed  (HD)  from  this  population.  We  usually  estimate  p  by  p  =  X  and  a2  by 

s2  =  E?=i(^-X)2/(n-l). 

For  example,  p  =  mean  income  of  a  household  in  Houston.  X  =  sample  average  of  incomes  of 
100  households  randomly  interviewed  in  Houston. 

This  estimator  of  p  could  have  been  obtained  by  either  of  the  following  two  methods  of 
estimation: 


(i)  Method  of  Moments 

Simply  stated,  this  method  of  estimation  uses  the  following  rule:  Keep  equating  population 
moments  to  their  sample  counterpart  until  you  have  estimated  all  the  population  parameters. 


B.H.  Baltagi,  Econometrics ,  Springer  Texts  in  Business  and  Economics,  DOI  10. 1007/978-3-642-20059-5  2, 
©  Springer-Verlag  Berlin  Heidelberg  201 1 
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Population 

Sample 

E{  X)  =  ii 

£7=i  Xi/n  =  X 

E{  X2)  =  n2  + a2 

£7=1  xf/n 

E(Xr) 

£7=1  XT /U 

The  normal  density  is  completely  identified  by  fi  and  cr2,  hence  only  the  first  2  equations  are 
needed 

fl  =  X  and  jz2  +  cr2  =  £/=1  Xf/n 
Substituting  the  first  equation  in  the  second  one  obtains 

=  £7=1  Xf/n  -  x2  =  £7=1  pq  -  xf/n 

(ii)  Maximum  Likelihood  Estimation  (MLE) 

For  a  random  sample  of  size  n  from  the  Normal  distribution  X,  ~  lV(/z,  cr2),  we  have 

fi(Xi ;  /z,  a2)  =  (l/cr\/£r)  exp  {-(X;  -  ii)2/2a2}  -  oo  <  X*  <  +oo 

Since  Ah, ,  Xn  are  independent  and  identically  distributed,  the  joint  probability  density  func¬ 
tion  is  given  as  the  product  of  the  marginal  probability  density  functions: 

n 

f{Xi,  ■  •  • ,  Xn;  n,  a2)  =  n  fi{Xi;n,a2)  =  (1/2 vrcr2)n/2  exp{-£/=1(Xj  -  ii)2 /2a2}  (2.1) 

i= 1 

Usually,  we  observe  only  one  sample  of  n  households  which  could  have  been  generated  by  any 
pair  of  (/z,  a2)  with  — oo  <  /z  <  +oo  and  a2  >  0.  For  each  pair,  say  (/r0,  0g),  f(X i, . . . ,  An;  (i0,a q) 
denotes  the  probability  (or  likelihood)  of  obtaining  that  sample.  By  varying  (/z,  a2)  we  get  differ¬ 
ent  probabilities  of  obtaining  this  sample.  Intuitively,  we  choose  the  values  of  /i  and  a2  that  max¬ 
imize  the  probability  of  obtaining  this  sample.  Mathematically,  we  treat  f(X\, . . . ,  Xn;  /z,  a2)  as 
L(ii,a2)  and  we  call  it  the  likelihood  function.  Maximizing  L(/z,cr2)  with  respect  to  fi  and  cr2, 
one  gets  the  first-order  conditions  of  maximization: 

(dL/dn)  =  0  and  ( dL/da 2)  =  0 

Equivalently,  we  can  maximize  log L(/z,  cr2)  rather  than  L(/z,  cr2)  and  still  get  the  same  answer. 
Usually,  the  latter  monotonic  transformation  of  the  likelihood  is  easier  to  maximize  and  the 
first-order  conditions  become 

(<91og  L/d[i)  =  0  and  (<91ogL/<9cr2)  =  0 

For  the  Normal  distribution  example,  we  get 

log L(n;  a2)  =  -{n/ 2) log  cr2  -  (n/2)log  2vr  -  (l/2cr2)  £7=1(X*  -  fi)2 

<91og L(/z;  a2) /dii  =  (1  /a2)  £7=i(A*  -  /z)  =  0  =>  /iMLE  =  X 

dlogL(fi;  a2) /da2  =  -(n/2)(l/<r2)  +  £7=i(Xj  -  [i)2 /2a 4  =  0 


=►  &MLE  =  £7=1  (Xi  -  llMLE)2/n  =  £7=1(A,  -  X)2/n 
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Note  that  the  moments  estimators  and  the  maximum  likelihood  estimators  are  the  same  for 
the  Normal  distribution  example.  In  general,  the  two  methods  need  not  necessarily  give  the 
same  estimators.  Also,  note  that  the  moments  estimators  will  always  have  the  same  estimating 
equations,  for  example,  the  first  two  equations  are  always 

E(X)  =  11  =  Y!i=i  Xi/n  =  X  and  E(X2)  =  fi2  +  a2  =  E?=i  Xf/n. 

For  a  specific  distribution,  we  need  only  substitute  the  relationship  between  the  population 
moments  and  the  parameters  of  that  distribution.  Again,  the  number  of  equations  needed 
depends  upon  the  number  of  parameters  of  the  underlying  distribution.  For  e.g.,  the  exponential 
distribution  has  one  parameter  and  needs  only  one  equation  whereas  the  gamma  distribution  has 
two  parameters  and  needs  two  equations.  Finally,  note  that  the  maximum  likelihood  technique 
is  heavily  reliant  on  the  form  of  the  underlying  distribution,  but  it  has  desirable  properties  when 
it  exists.  These  properties  will  be  discussed  in  the  next  section. 

So  far  we  have  dealt  with  the  Normal  distribution  to  illustrate  the  two  methods  of  estima¬ 
tion.  We  now  apply  these  methods  to  the  Bernoulli  distribution  and  leave  other  distributions 
applications  to  the  exercises.  We  urge  the  student  to  practice  on  these  exercises. 

Bernoulli  Example:  In  various  cases  in  real  life  the  outcome  of  an  event  is  binary,  a  worker  may 
join  the  labor  force  or  may  not.  A  criminal  may  return  to  crime  after  parole  or  may  not.  A 
television  off  the  assembly  line  may  be  defective  or  not.  A  coin  tossed  comes  up  head  or  tail, 
and  so  on.  In  this  case  9  =  Pr[Head]  and  1  —  9  =  PrfTail]  with  0  <  9  <  1  and  this  can  be 
represented  by  the  discrete  probability  function 

f{X-6)  =  9x(l-9)l~x  A  =  0,1 

=  0  elsewhere 

The  Normal  distribution  is  a  continuous  distribution  since  it  takes  values  for  all  X  over  the  real 
line.  The  Bernoulli  distribution  is  discrete,  because  it  is  defined  only  at  integer  values  for  X. 
Note  that  P[X  =  1]  =  /( 1;  9)  =  9  and  P[X  =  0]  =  /( 0;  9)  =  1  —  9  for  all  values  of  0  <  9  <  1. 
A  random  sample  of  size  n  drawn  from  this  distribution  will  have  a  joint  probability  function 

L{9)  =  f(Xu  ...,Xn;9)  =  o£hx‘(l  -  9)n~^=  iXi 

with  Xi  =  0, 1  for  i  =  1, . . . ,  n.  Therefore, 

log  m  =  (Er=i^)iog0+(n-Er=i^)iog(i-0) 
dlogm  =  Eti  Xj  (n  -  EILi  *») 
d9  9  (1  -  9) 

Solving  this  first-order  condition  for  9,  one  gets 
(E’Li^.)(l-0)-0(n-E”=1^)  =  O 
which  reduces  to 


9mle  =  £ILi  Xi/n  =  X. 


This  is  the  frequency  of  heads  in  n  tosses  of  a  coin. 
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For  the  method  of  moments,  we  need 

E(X)  =  ELo  Xf{X,  9)  =  l./(l,  9)  +  0./(0, 9)  =  /( 1,6)  =  6 

and  this  is  equated  to  X  to  get  9  =  X.  Once  again,  the  MLE  and  the  method  of  moments  yield 
the  same  estimator.  Note  that  only  one  parameter  9  characterizes  this  Bernoulli  distribution 
and  one  does  not  need  to  equate  second  or  higher  population  moments  to  their  sample  values. 


2.3  Properties  of  Estimators 

(i)  Unbiasedness 

ju  is  said  to  be  unbiased  for  /x  if  and  only  if  E(jx)  =  /x 

For  Jj.  =  X,  we  have  E(X )  =  J2i= i  E(Xj)/n  =  [x  and  X  is  unbiased  for  /x.  No  distributional 
assumption  is  needed  as  long  as  the  X/s  are  distributed  with  the  same  mean  fx .  Unbiasedness 
means  that  “on  the  average”  our  estimator  is  on  target.  Let  us  explain  this  last  statement.  If 
we  repeat  our  drawing  of  a  random  sample  of  100  households,  say  200  times,  then  we  get  200 
X’s.  Some  of  these  X  ’s  will  be  above  [x  some  below  /x,  but  their  average  should  be  very  close 
to  /r.  Since  in  real  life  situations,  we  observe  only  one  random  sample,  there  is  little  consolation 
if  our  observed  X  is  far  from  But  the  larger  n  is  the  smaller  is  the  dispersion  of  this  X,  since 
var(X)  =  cr2/n  and  the  lesser  is  the  likelihood  of  this  X  to  be  very  far  from  This  leads  us  to 
the  concept  of  efficiency. 


(ii)  Efficiency 

For  two  unbiased  estimators,  we  compare  their  efficiencies  by  the  ratio  of  their  variances.  We  say 
that  the  one  with  lower  variance  is  more  efficient.  For  example,  taking  'jx1  =  X\  versus  /i2  =  X, 
both  estimators  are  unbiased  but  varQ/)  =  a2  whereas,  var(/x2)  =  cr2/n  and  {the  relative 
efficiency  of  /i,  with  respect  to  /x2}  =  var(/i2)/var(/i1)  =  1/n,  see  Figure  2.1.  To  compare  all 
unbiased  estimators,  we  find  the  one  with  minimum  variance.  Such  an  estimator  if  it  exists  is 
called  the  MVU  (minimum  variance  unbiased  estimator).  A  lower  bound  for  the  variance  of 
any  unbiased  estimator  Ji  of  /x.  is  known  in  the  statistical  literature  as  the  Cramer-Rao  lower 
bound,  and  is  given  by 

var (jx)  >  l/n{E(dlogf{X- fi)/dfx)}2  =  -1  / {nE (d2log f  {X- fi)/dfx2)}  (2.2) 

where  we  use  either  representation  of  the  bound  on  the  right  hand  side  of  (2.2)  depending  on 
which  one  is  the  simplest  to  derive. 

Example  1:  Consider  the  normal  density 

log f(Xi;fx)  =  (-l/2)logcr2  -  (l/2)log2vr  -  (1/2 )(X,  -  ix)2/cr 2 
dlog f(Xi]  fx)/dfx  =  (Xi  -  yu)/cr2 
<92log/(Xi;/r)/<V  =  -( 1/tr 2) 
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Figure  2.1  Efficiency  Comparisons 

with  E{d2logf(Xl\  fi)/dfi2}  =  —(1/a2).  Therefore,  the  variance  of  any  unbiased  estimator  of  /r, 
say  /I  satisfies  the  property  that  var(/i)  >  a2/n. 

Turning  to  a2;  let  9  =  a2,  then 

log f(X.i- 9)  =  — (l/2)log0  -  (l/2)log27r  -  (l/2)(Xj  -  fi)2/0 
dlogfiXi- 9) /d6  =  -1/2 9  +  (Xi  -  fj,)2/292  =  {(Xi  -  //)2  -  9}/292 

d2log f(Xf,  9)/802  =  1/2 02  -  (Xi  -  v)2/03  =  {9-  2(Xi  -  /r)2}/2 03 

E[d2\ogf(Xi]9) /892]  =  —(1/2 02),  since  E(X*  —  fi)2  =  0.  Hence,  for  any  unbiased  estimator  of 
9,  say  9,  its  variance  satisfies  the  following  property  var(0)  >  2 92/n,  or  var(a2)  >  2ai/n. 

Note  that,  if  one  finds  an  unbiased  estimator  whose  variance  attains  the  Cramer-Rao  lower 
bound,  then  this  is  the  MVU  estimator.  It  is  important  to  remember  that  this  is  only  a  lower 
bound  and  sometimes  it  is  not  necessarily  attained.  If  the  Xi  s  are  normal,  X  ~  N(gi,a2 /n). 
Hence,  X  is  unbiased  for  /r  with  variance  a2 /n  equal  to  the  Cramer-Rao  lower  bound.  Therefore, 
X  is  MVU  for  On  the  other  hand, 

Z2MLE  =  T,ti(Xi-X)2/n, 

and  it  can  be  shown  that  (na2AILE) /(n  —  1)  =  s 2  is  unbiased  for  a2.  In  fact,  (n  —  1  )s2 /a2  ~  Xn-i 
and  the  expected  value  of  a  Chi-squared  variable  with  (n  —  1)  degrees  of  freedom  is  exactly  its 
degrees  of  freedom.  Using  this  fact, 

E{(n-l)s2/a2}  =  E(X2n_1)  =  n-l. 

Therefore,  E(s 2)  =  a2.1  Also,  the  variance  of  a  Chi-squared  variable  with  (n  —  1)  degrees  of 
freedom  is  twice  these  degrees  of  freedom.  Using  this  fact, 

var{(n  -  l)s2/cr2}  =  var(xn-i)  =  2(™  -  1) 
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or 


{(n  —  l)2/cr4}var(s2)  =  2  (n  —  1). 

Hence,  the  var(s2)  =  2 cr4/(n  —  1)  and  this  does  not  attain  the  Cramer- Rao  lower  bound.  In  fact, 
it  is  larger  than  (2cr4/n).  Note  also  that  var^^-^)  =  {(n  —  l)2/n2}var(s2)  =  {2(n  —  l)}cr4/n2. 
This  is  smaller  than  (2u4/n)!  How  can  that  be?  Remember  that  &2mle  Is  a  biased  estimator 
of  <j2  and  hence,  var (d2MLE)  should  not  be  compared  with  the  Cramer-Rao  lower  bound.  This 
lower  bound  pertains  only  to  unbiased  estimators. 

Warning :  Attaining  the  Cramer-Rao  lower  bound  is  only  a  sufficient  condition  for  efficiency. 
Failing  to  satisfy  this  condition  does  not  necessarily  imply  that  the  estimator  is  not  efficient. 

Example  2:  For  the  Bernoulli  case 

log f(Xi;  9)  =  Xiloge  +  (1  -  A,)log(l  -  9) 

dlog f(Xi,9)/d9  =  (Xi/0)  -  (1  -  Xi)/(  1  -  9) 

82log f(Xf,9)/d92  =  (—Xi/92)  -  (1  -  Xi)/(l  -  9)2 

and  E[d2logf  (Xf,  9)/d92]  =  {—1/9)  —  1/(1  —  9)  =  —1/[9{1  —  0)].  Therefore,  for  any  unbiased 
estimator  of  9,  say  9,  its  variance  satisfies  the  following  property: 

var(0)  >  9(1  —  9)/n. 

For  the  Bernoulli  random  sample,  we  proved  that  p,  =  E(Xi)  =  9.  Similarly,  it  can  be  easily 
verified  that  a2  =  var(Xj)  =  9(1—9).  Hence,  X  has  mean  p  =  9  and  var  (A)  =  a2 /n  =  9(1— 9)/n. 
This  means  that  X  is  unbiased  for  9  and  it  attains  the  Cramer-Rao  lower  bound.  Therefore,  X 
is  MVU  for  9. 

Unbiasedness  and  efficiency  are  finite  sample  properties  (in  other  words,  true  for  any  finite 
sample  size  n).  Once  we  let  n  tend  to  oo  then  we  are  in  the  realm  of  asymptotic  properties. 

Example  3:  For  a  random  sample  from  any  distribution  with  mean  /i  it  is  clear  that  Jl  = 
(X  +  1/n)  is  not  an  unbiased  estimator  of  p  since  E(J1)  =  E(X  +  1/n )  =  /r  +  1/n.  However,  as 
n  — >  oo  the  lim  E(J1)  is  equal  to  //.  We  say,  that  Jl  is  asymptotically  unbiased  for  p. 

Example  4:  For  the  Normal  case 

°mle  =  (n-  1  )s2/n  and  E(a2MLE)  =  (n  -  l)a2/n. 

But  as  n  — >  oo,  lim  E(a2MLE)  =  a2.  Hence,  o\lLE  is  asymptotically  unbiased  for  <r2. 

Similarly,  an  estimator  which  attains  the  Cramer-Rao  lower  bound  in  the  limit  is  asymp¬ 
totically  efficient.  Note  that  var(A)  =  (J2/n,  and  this  tends  to  zero  as  n  — >  oo.  Hence,  we 
consider ffinX  which  has  finite  variance  since  var(y/nA)  =  n  var(A)  =  cr2.  We  say  that  the 
asymptotic  variance  of  X  denoted  by  asymp.var(A)  =  a2/n  and  that  it  attains  the  Cramer- 
Rao  lower  bound  in  the  limit.  X  is  therefore  asymptotically  efficient.  Similarly, 

var(ffind2MLE)  =  n  var(d2MLE)  =  2(n  -  l)fj4/n 

which  tends  to  2cr4  as  n  — >  oo.  This  means  that  asymp.var(3:2^i£;)  =  2 a4/n  and  that  it  attains 
the  Cramer-Rao  lower  bound  in  the  limit.  Therefore,  &mle  Is  asymptotically  efficient. 
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(iii)  Consistency 

Another  asymptotic  property  is  consistency.  This  says  that  as  n  — >  oo  lim  Pr[|X  —  g\  >  c]  =  0 
for  any  arbitrary  positive  constant  c.  In  other  words,  X  will  not  differ  from  /i  as  n  — ►  oo. 
Proving  this  property  uses  the  Chebyshev’s  inequality  which  states  in  this  context  that 

Pr[|X  —  g\>  k(Jx\  <  1/k2. 

If  we  let  c  =  kax  then  1/k2  =  o\/(?  =  a2 /nc2  and  this  tends  to  0  as  n  — >  oo,  since  a2  and  c 
are  finite  positive  constants.  A  sufficient  condition  for  an  estimator  to  be  consistent  is  that  it  is 
asymptotically  unbiased  and  that  its  variance  tends  to  zero  as  n  — *  oo.2 

Example  1:  For  a  random  sample  from  any  distribution  with  mean  g  and  variance  a2,  E(X)  =  g 
and  var(A)  =  a2 /n  — >  0  as  n  — >  oo,  hence  X  is  consistent  for  g. 

Example  2:  For  the  Normal  case,  we  have  shown  that  E(s2)  =  a2  and  var(s2)  =  2cr4/(n  — 1)  — *  0 
as  n  — >  oo,  hence  s2  is  consistent  for  a2. 

Example  3:  For  the  Bernoulli  case,  we  know  that  E{X)  =  9  and  var(A)  =  0(1  —  6)/n  — >  0  as 
n  — >  oo,  hence  X  is  consistent  for  9. 

Warning :  This  is  only  a  sufficient  condition  for  consistency.  Failing  to  satisfy  this  condition 
does  not  necessarily  imply  that  the  estimator  is  inconsistent. 


(iv)  Sufficiency 

X  is  sufficient  for  g,  if  X  contains  all  the  information  in  the  sample  pertaining  to  fi.  In  other 
words,  f(X i, . . . ,  Xn/X)  is  independent  of  /r.  To  prove  this  fact  one  uses  the  factorization 
theorem  due  to  Fisher  and  Neyman.  In  this  context,  X  is  sufficient  for  /j,  if  and  only  if  one  can 
factorize  the  joint  p.d.f. 

Wi,  ■■■,  Xn,  fi)  =  h{X-  n)  ■  g(Xu  . . . ,  Xn) 

where  h  and  g  are  any  two  functions  with  the  latter  being  only  a  function  of  the  X’s  and 
independent  of  g  in  form  and  in  the  domain  of  the  X’s. 

Example  1:  For  the  Normal  case,  it  is  clear  from  equation  (2.1)  that  by  subtracting  and  adding 
X  in  the  summation  we  can  write  after  some  algebra 

f(Xl7 ..,  Xn;  g,  a2)  =  (l/2na2)n/2eXW2-2)  e-{(n/2^)(X-^} 

Hence,  h(X ;  g)  =  anci  g(X\, . . . ,  Xn)  is  the  remainder  term  which  is  independent 

of  g  in  form.  Also  — oo  <  Xi  <  oo  and  hence  independent  of  g  in  the  domain.  Therefore,  X  is 
sufficient  for  g. 

Example  2:  For  the  Bernoulli  case, 

f(Xi, . .  • ,  Xn:  9)  =  0n*(  1  -  9)^-^  Xi  =  0, 1  for  *  =  1, . . . ,  n. 

Therefore,  h(X.  0)  =  9n* ( 1  —  0)ra(1-^)  and  g(X i, . . . ,  Xn)  =  1  which  is  independent  of  9  in  form 
and  domain.  Hence,  X  is  sufficient  for  9. 
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Under  certain  regularity  conditions  on  the  distributions  we  are  sampling  from,  one  can  show 
that  the  MVU  of  any  parameter  9  is  an  unbiased  function  of  a  sufficient  statistic  for  9. 2  Advan¬ 
tages  of  the  maximum  likelihood  estimators  is  that  (i)  they  are  sufficient  estimators  when  they 
exist,  (ii)  They  are  asymptotically  efficient,  (iii)  If  the  distribution  of  the  MLE  satisfies  certain 
regularity  conditions,  then  making  the  MLE  unbiased  results  in  a  unique  MVU  estimator.  A 
prime  example  of  this  is  s2  which  was  shown  to  be  an  unbiased  estimator  of  a2  for  a  random 
sample  drawn  from  the  Normal  distribution.  It  can  be  shown  that  s2  is  sufficient  for  a2  and  that 
(n  —  l)s2 /a2  ~  Xn_ i-  Hence,  s2  is  an  unbiased  sufficient  statistic  for  a2  and  therefore  it  is  MVU 
for  a2,  even  though  it  does  not  attain  the  Cramer-Rao  lower  bound,  (iv)  Maximum  likelihood 
estimates  are  invariant  with  respect  to  continuous  transformations.  To  explain  the  last  property, 
consider  the  estimator  of  e^.  Given  g mle  =  X,  an  obvious  estimator  is  e^MLE  =  ex .  This  is  in 
fact  the  MLE  of  e^.  In  general,  if  g(g)  is  a  continuous  function  of  g,  then  (j{pmle)  is  the  MLE  of 
g(g).  Note  that  E(e^MLE)  /  eE^MLE )  =  e^,  in  other  words,  expectations  are  not  invariant  to  all 
continuous  transformations,  especially  nonlinear  ones  and  hence  the  resulting  MLE  estimator 
may  not  be  unbiased.  ex  is  not  unbiased  for  even  though  X  is  unbiased  for  g. 

In  summary,  there  are  two  routes  for  finding  the  MVU  estimator.  One  is  systematically 
following  the  derivation  of  a  sufficient  statistic,  proving  that  its  distribution  satisfies  certain 
regularity  conditions,  and  then  making  it  unbiased  for  the  parameter  in  question.  Of  course, 
MLE  provides  us  with  sufficient  statistics,  for  example, 

X\ , . . . ,  Xn  ~  IIN(/x,  a2)  =>  gMLE  =  X  and  a2MLE  =  £?=i (V,  -  X)2/n 

are  both  sufficient  for  g  and  a2,  respectively.  X  is  unbiased  for  g  and  X  ~  N(g,a2/n).  The 
Normal  distribution  satisfies  the  regularity  conditions  needed  for  X  to  be  MVU  for  g.  &2mle  is 
biased  for  <r2,  but  s 2  =  —  1)  is  unbiased  for  a2  and  (n  —  1)s2/<t2  ~  Xn-i  which  also 

satisfies  the  regularity  conditions  for  s2  to  be  a  MVU  estimator  for  a2. 

Alternatively,  one  finds  the  Cramer-Rao  lower  bound  and  checks  whether  the  usual  estimator 
(obtained  from  say  the  method  of  moments  or  the  maximum  likelihood  method)  achieves  this 
lower  bound.  If  it  does,  this  estimator  is  efficient,  and  there  is  no  need  to  search  further.  If  it 
does  not,  the  former  strategy  leads  us  to  the  MVU  estimator.  In  fact,  in  the  previous  example 
X  attains  the  Cramer-Rao  lower  bound,  whereas  s2  does  not.  However,  both  are  MVU  for  g 
and  a2  respectively. 


(v)  Comparing  Biased  and  Unbiased  Estimators 

Suppose  we  are  given  two  estimators  8 \  and  62  of  6  where  the  first  is  unbiased  and  has  a  large 
variance  and  the  second  is  biased  but  with  a  small  variance.  The  question  is  which  one  of  these 
two  estimators  is  preferable?  6 \  is  unbiased  whereas  9 2  is  biased.  This  means  that  if  we  repeat 
the  sampling  procedure  many  times  then  we  expect  6 1  to  be  on  the  average  correct,  whereas 
02  would  be  on  the  average  different  from  9.  However,  in  real  life,  we  observe  only  one  sample. 
With  a  large  variance  for  9 1,  there  is  a  great  likelihood  that  the  sample  drawn  could  result  in 
a  9 1  far  away  from  9.  However,  with  a  small  variance  for  82,  there  is  a  better  chance  of  getting 
a  02  close  to  9.  If  our  loss  function  is  L(8,  9)  =  (9  —  9)2  then  our  risk  is 

R(9, 9)  =  E[L(9, 0)]  =  E(9  -  0)2  =  MSE{9) 

=  E\9  -  E(9)  +  E{9)  -  9]2  =  var (?)  +  (Bias(?))2. 
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Minimizing  the  risk  when  the  loss  function  is  quadratic  is  equivalent  to  minimizing  the  Mean 
Square  Error  (MSE).  From  its  definition  the  MSE  shows  the  trade-off  between  bias  and  variance. 
MVU  theory,  sets  the  bias  equal  to  zero  and  minimizes  var(0).  In  other  words,  it  minimizes  the 
above  risk  function  but  only  over  0’s  that  are  unbiased.  If  we  do  not  restrict  ourselves  to 
unbiased  estimators  of  0,  minimizing  MSE  may  result  in  a  biased  estimator  such  as  02  which 
beats  0 1  because  the  gain  from  its  smaller  variance  outweighs  the  loss  from  its  small  bias,  see 
Figure  2.2. 
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Figure  2.2  Bias  Versus  Variance 


2.4  Hypothesis  Testing 

The  best  way  to  proceed  is  with  an  example. 

Example  1:  The  Economics  Departments  instituted  a  new  program  to  teach  micro-principles. 
We  would  like  to  test  the  null  hypothesis  that  80%  of  economics  undergraduate  students  will 
pass  the  micro-principles  course  versus  the  alternative  hypothesis  that  only  50%  will  pass.  We 
draw  a  random  sample  of  size  20  from  the  large  undergraduate  micro-principles  class  and  as 
a  simple  rule  we  accept  the  null  if  x,  the  number  of  passing  students  is  larger  or  equal  to  13, 
otherwise  the  alternative  hypothesis  will  be  accepted.  Note  that  the  distribution  we  are  drawing 
from  is  Bernoulli  with  the  probability  of  success  0,  and  we  have  chosen  only  two  states  of  the 
world  Hq\  0o  =  0.80  and  Hi\9\  =  0.5.  This  situation  is  known  as  testing  a  simple  hypothesis 
versus  another  simple  hypothesis  because  the  distribution  is  completely  specified  under  the  null 
or  alternative  hypothesis.  One  would  expect  ( E(x )  =  n0o)  16  students  under  Hq  and  (n0i)  10 
students  under  H\  to  pass  the  micro-principles  exams.  It  seems  then  logical  to  take  x  >  13  as 
the  cut-off  point  distinguishing  Hq  from  H\.  No  theoretical  justification  is  given  at  this  stage 
to  this  arbitrary  choice  except  to  say  that  it  is  the  mid-point  of  [10, 16].  Figure  2.3  shows  that 
one  can  make  two  types  of  errors.  The  first  is  rejecting  Hq  when  in  fact  it  is  true,  this  is  known 
as  type  I  error  and  the  probability  of  committing  this  error  is  denoted  by  a.  The  second  is 
accepting  Hq  when  it  is  false.  This  is  known  as  type  II  error  and  the  corresponding  probability 
is  denoted  by  /3.  For  this  example 
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a  =  Projecting  Hq/Hq  is  true]  =  Pr[x  <  13/0  =  0.8] 

=  b(n  =  20;  x  =  0;  0  =  0.8)  +  ..  +  b(n  =  20;  x  =  12;  0  =  0.8) 

=  b(n  =  20;  x  =  20;  6  =  0.2)  +  ..  +  b{n  =  20;  x  =  8;  9  =  0.2) 

=  0  +  ..  +  0  +  0.0001  +  0.0005  +  0.0020  +  0.0074  +  0.0222  =  0.0322 

where  we  have  used  the  fact  that  b(n;  x;  9)  =  b(n;  n  —  x;  1  —  9)  and  b(n;  x;  9)  =  (”)  9X ( 1  —  9)n~x 
denotes  the  binomial  distribution  for  x  =  0, 1 , . . . ,  n,  see  problem  4. 


True  World 


0O  =  0.80 

0i  =0.50 

00 

No  error 

Type  II  error 

0i 

Type  I  error 

No  Error 

Figure  2.3  Type  I  and  II  Error 


/ 3  =  Pr [accepting  Hq/Hq  is  false]  =  Pr[x  >  13/0  =  0.5] 

=  b{n  =  20;  x  =  13;  0  =  0.5)  +  ..  +  b(n  =  20;  x  =  20;  0  =  0.5) 

=  0.0739  +  0.0370  +  0.0148  +  0.0046  +  0.0011  +  0.0002  +  0  +  0  =  0.1316 

The  rejection  region  for  Hq,x  <  13,  is  known  as  the  critical  region  of  the  test  and  a  =  PrfFalling 
in  the  critical  region/ Hq  is  true]  is  also  known  as  the  size  of  the  critical  region.  A  good  test 
is  one  which  minimizes  both  types  of  errors  a  and  (3.  For  the  above  example,  a  is  low  but  (3 
is  high  with  more  than  a  13%  chance  of  happening.  This  (3  can  be  reduced  by  changing  the 
critical  region  from  x  <  13  to  x  <  14,  so  that  Hq  is  accepted  only  if  x  >  14.  In  this  case,  one 
can  easily  verify  that 

a  =  Pr[z<  14/0  =  0.8]  =  b[n  =  20;  x  =  0;  0  =  0.8)  +  ..  +  b(n  =  20,  x  =  13,0  =  0.8) 

=  0.0322  +  bin  =  20;  x  =  13;  0  =  0.8)  =  0.0322  +  0.0545  =  0.0867 

and 

(3  =  Pr[x  >  14/0  =  0.5]  =  b{n  =  20;  x  =  14;  0  =  0.5)  +  ..  +  6(n  =  20;  x  =  20;  0  =  0.5) 

=  0.1316  -  bin  =  20;  x  =  13;  0  =  0.5)  =  0.0577 

By  becoming  more  conservative  on  accepting  Hq  and  more  liberal  on  accepting  H i,  one  reduces 
(3  from  0.1316  to  0.0577  but  the  price  paid  is  the  increase  in  a  from  0.0322  to  0.0867.  The  only 
way  to  reduce  both  a  and  (3  is  by  increasing  n.  For  a  fixed  n,  there  is  a  tradeoff  between  a  and 
(3  as  we  change  the  critical  region.  To  understand  this  clearly,  consider  the  real  life  situation  of 
trial  by  jury  for  which  the  defendant  can  be  innocent  or  guilty.  The  decision  of  incarceration 
or  release  implies  two  types  of  errors.  One  can  make  a  =  Prfincarcerating/innocence]  =  0  and 
f3  =  its  maximum,  by  releasing  every  defendant.  Or  one  can  make  (3  =  Pr  [release/guilty]  =  0 
and  a  =  its  maximum,  by  incarcerating  every  defendant.  These  are  extreme  cases  but  hopefully 
they  demonstrate  the  trade-off  between  a  and  (3. 
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The  Neyman-Pearson  Theory 

The  classical  theory  of  hypothesis  testing,  known  as  the  Neyman-Pearson  theory,  fixes  a  = 
Pr(type  I  error)  <  a  constant  and  minimizes  (5  or  maximizes  (1  —  (3).  The  latter  is  known  as 
the  Power  of  the  test  under  the  alternative. 

The  Neyman-Pearson  Lemma :  If  C  is  a  critical  region  of  size  a  and  A;  is  a  constant  such  that 
{Lq/L\)  <  k  inside  C 


and 


(Lo/Li)  >  k  outside  C 

then  C  is  a  most  powerful  critical  region  of  size  a  for  testing  Hq]  6  =  6q,  against  H\\9  =  6\. 

Note  that  the  likelihood  has  to  be  completely  specified  under  the  null  and  alternative.  Hence, 
this  lemma  applies  only  to  testing  a  simple  versus  another  simple  hypothesis.  The  proof  of  this 
lemma  is  given  in  Freund  (1992).  Intuitively,  Lq  is  the  likelihood  function  under  the  null  Hq 
and  L\  is  the  corresponding  likelihood  function  under  H\.  Therefore,  ( Lq/L\ )  should  be  small 
for  points  inside  the  critical  region  C  and  large  for  points  outside  the  critical  region  C.  The 
proof  of  the  theorem  shows  that  any  other  critical  region,  say  D,  of  size  a  cannot  have  a  smaller 
probability  of  type  II  error  than  C.  Therefore,  C  is  the  best  or  most  powerful  critical  region  of 
size  a.  Its  power  (1  —  (3)  is  maximum  at  H\.  Let  us  demonstrate  this  lemma  with  an  example. 

Example  2:  Given  a  random  sample  of  size  n  from  N(/a,a2  =  4),  use  the  Neyman-Pearson 
lemma  to  find  the  most  powerful  critical  region  of  size  a  =  0.05  for  testing  Hq]  /r0  =  2  against 
the  alternative  \  /r1  =  4. 

Note  that  this  is  a  simple  versus  simple  hypothesis  as  required  by  the  lemma,  since  a2  =  4 
is  known  and  ^  is  specified  by  Ho  and  H\.  The  likelihood  function  for  the  IV(/q 4)  density  is 
given  by 

l(t)  =  f(xi>  ■  ■  •  j  xn\  n,  4)  =  (l/2\/27r)nexp  {—  Y^i=i(xi  ~  aO2/8} 
so  that 

Lo  =  l(To)  =  (1/2v/2vr)riexp  {—  “  2)2/8} 

and 

Li  =  L^)  =  (l/2y/27r)nexp  {—  Y^=\{xi  ~  4)2/8} 

Therefore 

Wii  =  exp  {-  [£?„!(*  -  2f  -  -  4)2]  /8}  =  exp  {-  i,/2  +  3n/2} 

and  the  critical  region  is  defined  by 

exp  {—  l  x*/2  +  3n/2}  <  k  inside  C 

Taking  logarithms  of  both  sides,  subtracting  (3/2 )n  and  dividing  by  (— l/2)n  one  gets 


x  >  K  inside  C 
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In  practice,  one  need  not  keep  track  of  K  as  long  as  one  keeps  track  of  the  direction  of  the 
inequality.  I\  can  be  determined  by  making  the  size  of  C  =  a  =  0.05.  In  this  case 

a  =  Pr[x  >  K/ p  =  2]  =  Pr[z  >  (K  —  2) / (2 / \fn)} 

where  z  =  {x  —  2)/(2/y/n)  is  distributed  IV(0, 1)  under  Hq.  From  the  IV(0, 1)  tables,  we  have 

K~/2,  =  1.645 

(2/VH) 

Hence, 

K  =  2  +  1.645  (2/ Vn) 

and  x  >  2  +  1.645(2 /y/n)  defines  the  most  powerful  critical  region  of  size  a  =  0.05  for  testing 
Ho]  Ho  =  ^  versus  Hi]  /u1  =  4.  Note  that,  in  this  case 

(3  =  Pr[x  <  2  +  1.645(2/y/n)/p  =  4] 

=  Pr [z  <  [-2  +  1.645(2/ v/n)]/(2/y/n)]  =  Pr [z  <  1.645  -  y/n\ 

For  n  =  4;  /3  =  Pr[z  <  —0.355]  =  0.3613  shown  by  the  shaded  region  in  Figure  2.4.  For  n  =  9; 
0  =  Pr [z  <  -1.355]  =  0.0877,  and  for  n  =  16;  (3  =  Pr [z  <  -2.355]  =  0.00925. 


Figure  2.4  Critical  Region  for  Testing  /r0  =  2  against  fi1  =  4  for  n  =  4 

This  gives  us  an  idea  of  how,  for  a  fixed  a  =  0.05,  the  minimum  (3  decreases  with  larger  sample 
size  n.  As  n  increases  from  4  to  9  to  16,  the  var(x)  =  a2 /n  decreases  and  the  two  distributions 
shown  in  Figure  2.4  shrink  in  dispersion  still  centered  around  fj,0  =  2  and  =  4,  respectively. 
This  allows  better  decision  making  (based  on  larger  sample  size)  as  reflected  by  the  critical 
region  shrinking  from  x  >  3.65  for  n  =  4  to  x  >  2.8225  for  n  =  16,  and  the  power  (1  —  (3)  rising 
from  0.6387  to  0.9908,  respectively,  for  a  fixed  a  <  0.05.  The  power  function  is  the  probability 
of  rejecting  Hq.  It  is  equal  to  a  under  Ho  and  1  —  (3  under  H\ .  The  ideal  power  function  is  zero 
at  Ho  and  one  at  H\.  The  Neyman-Pearson  lemma  allows  us  to  fix  a,  say  at  0.05,  and  find  the 
test  with  the  best  power  at  H\ . 

In  example  2,  both  the  null  and  alternative  hypotheses  are  simple.  In  real  life,  one  is  more 
likely  to  be  faced  with  testing  Hq]  H  =  2  versus  Hi] /j,  /  2.  Under  the  alternative  hypothesis, 
the  distribution  is  not  completely  specified,  since  the  mean  fi  is  not  known,  and  this  is  referred 
to  as  a  composite  hypothesis.  In  this  case,  one  cannot  compute  the  probability  of  type  II  error 
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since  the  distribution  is  not  known  under  the  alternative.  Also,  the  Neyman- Pearson  lemma 
cannot  be  applied.  However,  a  simple  generalization  allows  us  to  compute  a  Likelihood  Ratio 
test  which  has  satisfactory  properties  but  is  no  longer  uniformly  most  powerful  of  size  a.  In  this 
case,  one  replaces  L\,  which  is  not  known  since  H\  is  a  composite  hypothesis,  by  the  maximum 
value  of  the  likelihood,  i.e. , 

rnaxLo 

rnaxL 

Since  max  Lq  is  the  maximum  value  of  the  likelihood  under  the  null  while  rnaxL  is  the  maximum 
value  of  the  likelihood  over  the  whole  parameter  space,  it  follows  that  rnaxLo  <  rnaxL  and  A  <  1. 
Hence,  if  Hq  is  true,  A  is  close  to  1,  otherwise  it  is  smaller  than  1.  Therefore,  A  <  k  defines  the 
critical  region  for  the  Likelihood  Ratio  test,  and  k  is  determined  such  that  the  size  of  this  test 
is  a. 

Example  3:  For  a  random  sample  x\, . . .  ,xn  drawn  from  a  Normal  distribution  with  mean  ^ 
and  variance  a2  =  4,  derive  the  Likelihood  Ratio  test  for  Hq]  /r  =  2  versus  H\;  /j  ^  2.  In  this 
case 


maxL0  =  (l/2\/27r)nexp  {-  YJi=i(xi  ~  2)2/8}  =  Lo 

and 

rnaxL  =  (l/2\/27r)nexp  {-  ~  T)2/8}  =  l(Rmle) 

where  use  is  made  of  the  fact  that  /2 mle  =  Therefore, 

A  =  exp  {  [-  E”=i (xi  ~  2)2  +  E"=i (xi  ~  ^)2]  /8)  =  exp  {-n{x  -  2)2/8} 

Hence,  the  region  for  which  A  <  k,  is  equivalent  after  some  simple  algebra  to  the  following 
region 

(x  -  2)2  >  K  or  \x  -  2|  >  K1/2 
where  K  is  determined  such  that 
Pr[|s  -  2|  >  Kl/2/n  =  2}  =  a 

We  know  that  x  ~  AT(2,4/n)  under  Hq.  Hence,  z  =  {x  —  2)/(2/y/n)  is  IV(0, 1)  under  Hq,  and 
the  critical  region  of  size  a  will  be  based  upon  \z\  >  zai 2  where  za/2  is  given  in  Figure  2.5  and 
is  the  value  of  a  IV(0, 1)  random  variable  such  that  the  probability  of  exceeding  it  is  a/2.  For 
a  =  0.05,  za/ 2  =  1.96,  and  for  a  =  0.10,  za/2  =  1.645.  This  is  a  two-tailed  test  with  rejection  of 
Hq  obtained  in  case  z  <  —za/2  or  z  >  za/ 2. 

Note  that  in  this  case 


LR  =  — 21ogA  =  (x  —  2)2/(4/n)  =  z 2 

which  is  distributed  as  xf  under  Hq.  This  is  because  it  is  the  square  of  a  1V(0, 1)  random  variable 
under  Hq.  This  is  a  finite  sample  result  holding  for  any  n.  In  general,  other  examples  may  lead 
to  more  complicated  A  statistics  for  which  it  is  difficult  to  find  the  corresponding  distributions 
and  hence  the  corresponding  critical  values.  For  these  cases,  we  have  an  asymptotic  result 
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Figure  2.5  Critical  Values 


which  states  that,  for  large  n,  LR  =  — 21ogA  will  be  asymptotically  distributed  as  xt  where  v 
denotes  the  number  of  restrictions  that  are  tested  by  Hq.  For  example  2,  v  =  1  and  hence,  LR 
is  asymptotically  distributed  as  Xi-  Note  that  we  did  not  need  this  result  as  we  found  LR  is 
exactly  distributed  as  x\  for  any  n-  If  one  is  testing  Hq]  /i  =  2  and  a2  =  4  against  the  alternative 
that  Hi]  [i  /  2  or  a2  /  4,  then  the  corresponding  LR  will  be  asymptotically  distributed  as  x|) 
see  problem  5,  part  (f). 


Likelihood  Ratio,  Wald  and  Lagrange  Multiplier  Tests 

Before  we  go  into  the  derivations  of  these  three  tests  we  start  by  giving  an  intuitive  graphical 
explanation  that  will  hopefully  emphasize  the  differences  among  these  tests.  This  intuitive 
explanation  is  based  on  the  article  by  Buse  (1982). 

Consider  a  quadratic  log-likelihood  function  in  a  parameter  of  interest,  say  /r.  Figure  2.6 
shows  this  log-likelihood  log T(/r),  with  a  maximum  at  Jl.  The  Likelihood  Ratio  test,  tests  the 
null  hypothesis  Hq]  ^  =  /j,0  by  looking  at  the  ratio  of  the  likelihoods  A  =  L(n0)/L(j2)  where 


Figure  2.6  Wald  Test 
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— 21ogA,  twice  the  difference  in  log-likelihood,  is  distributed  asymptotically  as  Xi  under  Hq.  This 
test  differentiates  between  the  top  of  the  hill  and  a  preassigned  point  on  the  hill  by  evaluating 
the  height  at  both  points.  Therefore,  it  needs  both  the  restricted  and  unrestricted  maximum 
of  the  likelihood.  This  ratio  is  dependent  on  the  distance  of  /i0  from  Jl  and  the  curvature  of 
the  log-likelihood,  C(^)  =  \d2logL(n)  /d/i2\,  at  Jl.  In  fact,  for  a  fixed  (Jl  —  n0),  the  larger  C(J2), 
the  larger  is  the  difference  between  the  two  heights.  Also,  for  a  given  curvature  at  Jl,  the  larger 
(jl  —  ^ o)  the  larger  is  the  difference  between  the  heights.  The  Wald  test  works  from  the  top  of 
the  hill,  i.e. ,  it  needs  only  the  unrestricted  maximum  likelihood.  It  tries  to  establish  the  distance 
to  /r0,  by  looking  at  the  horizontal  distance  (/I  —  ^0),  and  the  curvature  at  Jl.  In  fact  the  Wald 
statistic  is  W  =  (Jl  —  /r0)2 C (Jl)  and  this  is  asymptotically  distributed  as  Xi  under  Ho.  The  usual 
form  of  W  has  /(/i)  =  —E[d2logL(^,)/d^i2]  the  information  matrix  evaluated  at  Jl,  rather  than 
C(J2),  but  the  latter  is  a  consistent  estimator  of  /(/r).  The  information  matrix  will  be  studied 
in  details  in  Chapter  7.  It  will  be  shown,  under  fairly  general  conditions,  that  Jl  the  MLE  of 
/r,  has  var(/I)  =  Hence  W  =  (Jl  —  /i0)2/var(/I)  all  evaluated  at  the  unrestricted  MLE. 

The  Lagrange-Multiplier  test  (LM),  on  the  other  hand,  goes  to  the  preassigned  point  ^0,  i.e., 
it  only  needs  the  restricted  maximum  likelihood,  and  tries  to  determine  how  far  it  is  from  the 
top  of  the  hill  by  considering  the  slope  of  the  tangent  to  the  likelihood  S(^)  =  <91ogL(/r)/<9/i  at 
/r0,  and  the  rate  at  which  this  slope  is  changing,  i.e.,  the  curvature  at  //0.  As  Figure  2.7  shows, 
for  two  log-likelihoods  with  the  same  S(n0),  the  one  that  is  closer  to  the  top  of  the  hill  is  the 
one  with  the  larger  curvature  at  //0. 


This  suggests  the  following  statistic:  LM  =  S2(/i0){C,(/r0)}_1  where  the  curvature  appears  in 
inverse  form.  In  the  Appendix  to  this  chapter,  we  show  that  the  E[S(n)]  =  0  and  var[S'(^)]  = 
1(h).  Hence  LM  =  52(//0)^_1(^o)  =  *S'2(//o)/var[,5,(/xo)]  evaluated  at  the  restricted  MLE. 
Another  interpretation  of  the  LM  test  is  that  it  is  a  measure  of  failure  of  the  restricted  estimator, 
in  this  case  t°  satisfy  the  first-order  conditions  of  maximization  of  the  unrestricted  likelihood. 
We  know  that  S(jL)  =  0.  The  question  is:  to  what  extent  does  S(/i 0)  differ  from  zero?  S'(/r)  is 
known  in  the  statistics  literature  as  the  score ,  and  the  LM  test  is  also  referred  to  as  the  score 
test.  For  a  more  formal  treatment  of  these  tests,  let  us  reconsider  example  3  of  a  random  sample 
x\, ...  ,xn  from  a  IV(/r, 4)  where  we  are  interested  in  testing  Hq]  =  2  versus  Hi]  fi  /  2.  The 
likelihood  function  L(n)  as  well  as  LR  =  — 21ogA  =  n(x  —  2)2/4  were  given  in  example  3.  In 
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fact,  the  score  function  is  given  by 


_  dlogLQ)  _  E"=j(a j  -  fi)  _  n(x  -  //) 
^  d/i  4  4 


and  under  iLo 


SVo) 

C'(M) 


5(2)  =  n(*  ~  2) 
4 

<92logL(/r) 

3/i2 


and  /(//)  =  —  E 


d2\og  L{y) 
d  y2 


n 
4 

The  Wald  statistic  is  based  on 


n 

4 


n 

4 


C'(M). 


W  —  i^MLE  ~  Z)2I(HmLe)  ~  ix  ~  2)2  ‘  (  j) 


The  LM  statistic  is  based  on 


LM  =  52(/r0)/~1(/r0) 


n2(x  —  2)2  4 

16  n 


n(x  —  2)2 
4 


Therefore,  W  =  LM  =  LR  for  this  example  with  known  variance  a 2  =  4.  These  tests  are  all 
based  upon  the  \x  —  2|  >  k  critical  region,  where  k  is  determined  such  that  the  size  of  the  test 
is  a.  In  general,  these  test  statistics  are  not  always  equal,  as  is  shown  in  the  next  example. 

Example  4:  For  a  random  sample  x±, . . . ,  xn  drawn  from  a  a2)  with  unknown  a2,  test  the 
hypothesis  Hq]  fi  =  2  versus  Hi;  ji  2.  Problem  5,  part  (c),  asks  the  reader  to  verify  that 


LR  =  nlog 


Ln=iO* 


2)2" 

x)2 


whereas  W 


n2{x  —  2)2 

E’Li(^-s)2 


and  LM 


n2(x  —  2)2 
E"=i(*i  -2)2‘ 


One  can  easily  show  that  LM/n  =  (W/n) /[l+(W/n)\  and  LR/n  =  log[l+(I/F / n)\.  Let  y  =  W/n, 
then  using  the  inequality  y  >  log(l  +  y)  >  y/(l  +  y),  one  can  conclude  that  W  >  LR  >  LM. 
This  inequality  was  derived  by  Berndt  and  Savin  (1977),  and  will  be  considered  again  when  we 
study  test  of  hypotheses  in  the  general  linear  model.  Note,  however  that  all  three  test  statistics 
are  based  upon  \x  —  2|  >  k  and  for  finite  n,  the  same  exact  critical  value  could  be  obtained 
from  the  Normally  distributed  x.  This  section  introduced  the  W,  LR  and  LM  test  statistics,  all 
of  which  have  the  same  asymptotic  distribution.  In  addition,  we  showed  that  using  the  normal 
distribution,  when  a2  is  known,  W  =  LR  =  LM  for  testing  Hq;  [i  =  2  versus  Hi;  y  ^  2. 
However,  when  cr2  is  unknown,  we  showed  that  W  >  LR  >  LM  for  the  same  hypothesis. 

Example  5:  For  a  random  sample  xi, . . . ,  xn  drawn  from  a  Bernoulli  distribution  with  parameter 
9 ,  test  the  hypothesis  Hq;  9  =  9q  versus  Hi;  9  ^  9 o,  where  9q  is  a  known  positive  fraction.  This 
example  is  based  on  Engle  (1984).  Problem  4,  part  (i),  asks  the  reader  to  derive  LR,  W  and 
LM  for  Hq;  9  =  0.2  versus  Hi;  9  /  0.2.  The  likelihood  L(9)  and  the  Score  S(9)  were  derived  in 
section  2.2.  One  can  easily  verify  that 


C{9) 


d2\og  L{9) 
092 


En  v—vn 

i=lxi  ,  n-z2^i  Xi 
92  (1  -  9)2 
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and 

n 

W^) 

The  Wald  statistic  is  based  on 

W  =  ( OmLE  -  Oq )2I{0MLe)  =  (x  —  90)2  ■  ZT7Z - —  =  j - 

x(\  —  x)  x(l  —  x)/n 


1(9)  =  -E 


crlog  L(9) 


50- 


using  the  fact  that  9mle  =  x.  The  LM  statistic  is  based  on 


lm  =  s2(90)r\e0) 


(x-0o)2  0q(1  —  Op) 

[90(1  -  90)/n]2  n 


(x  -  90)2 
0O(1  -  0o)/n 


Note  that  the  numerator  of  the  W  and  LM  are  the  same.  It  is  the  denominator  which  is  the 
var(x)  =  0(1  —  Q)/n  that  is  different.  For  Wald,  this  var(x)  is  evaluated  at  Omle ,  whereas  for 
LM,  this  is  evaluated  at  Oq. 

The  LR  statistic  is  based  on 


logL(0MLE)  =  EIU  Xilogx  +  (n  -  YJi Li  -  x) 

and 


logL(0o)  =  YJL i  ^log0o  +  (n  -  Yh=i  “  0o) 

so  that 

LR  =  — 21og L(0O)  +  21og L(6Mle)  =  ®j(log0o  -  logx) 

+(n  -  EILi  ^OQogCl  -  Oq)  -  log(l  -  5))] 

For  this  example,  LR  looks  different  from  W  and  LM.  However,  a  second-order  Taylor-Series 
expansion  of  LR  around  0 o  =  x  yields  the  same  statistic.  Also,  for  n  — >  oo,  plim  x  =  0  and  if 
Hq  is  true,  then  all  three  statistics  are  asymptotically  equivalent.  Note  also  that  all  three  test 
statistics  are  based  upon  \x  —  0o|  >  k  and  for  finite  n,  the  same  exact  critical  value  could  be 
obtained  from  the  binomial  distribution.  See  problem  19  for  more  examples  of  the  conflict  in 
test  of  hypotheses  using  the  W,  LR  and  LM  test  statistics. 

Bera  and  Permaratne  (2001,  p.  58)  tell  the  following  amusing  story  that  can  bring  home 
the  interrelationship  among  the  three  tests:  “Once  around  1946  Ronald  Fisher  invited  Jerzy 
Neyman,  Abraham  Wald,  and  C.R.  Rao  to  his  lodge  for  afternoon  tea.  During  their  conversation, 
Fisher  mentioned  the  problem  of  deciding  whether  his  dog,  who  had  been  going  to  an  “obedience 
school”  for  some  time,  was  disciplined  enough.  Neyman  quickly  came  up  with  an  idea:  leave 
the  dog  free  for  some  time  and  then  put  him  on  his  leash.  If  there  is  not  much  difference  in 
his  behavior,  the  dog  can  be  thought  of  as  having  completed  the  course  successfully.  Wald, 
who  lost  his  family  in  the  concentration  camps,  was  adverse  to  any  restrictions  and  simply 
suggested  leaving  the  dog  free  and  seeing  whether  it  behaved  properly.  Rao,  who  had  observed 
the  nuisances  of  stray  dogs  in  Calcutta  streets  did  not  like  the  idea  of  letting  the  dog  roam 
freely  and  suggested  keeping  the  dog  on  a  leash  at  all  times  and  observing  how  hard  it  pulls 
on  the  leash.  If  it  pulled  too  much,  it  needed  more  training.  That  night  when  Rao  was  back 
in  his  Cambridge  dormitory  after  tending  Fisher’s  mice  at  the  genetics  laboratory,  he  suddenly 
realized  the  connection  of  Neyman  and  Wald’s  recommendations  to  the  Neyman-Pearson  LR 
and  Wald  tests.  He  got  an  idea  and  the  rest  is  history.” 
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2.5  Confidence  Intervals 

Estimation  methods  considered  in  section  2.2  give  us  a  point  estimate  of  a  parameter,  say  /r, 
and  that  is  the  best  bet,  given  the  data  and  the  estimation  method,  of  what  n  might  be.  But 
it  is  always  good  policy  to  give  the  client  an  interval,  rather  than  a  point  estimate,  where  with 
some  degree  of  confidence,  usually  95%  confidence,  we  expect  to  lie.  We  have  seen  in  Figure 
2.5  that  for  a  1V(0, 1)  random  variable  z,  we  have 

Pr[-£Q/2  <  ^  <  za/2]  =  1-a 

and  for  a  =  5%,  this  probability  is  0.95,  giving  the  required  95%  confidence.  In  fact,  za/2  =  1.96 
and 


Pr[— 1.96  <  z  <  1.96]  =  0.95 

This  says  that  if  we  draw  100  random  numbers  from  a  IV  (0, 1)  density,  (using  a  normal  random 
number  generator)  we  expect  95  out  of  these  100  numbers  to  lie  in  the  [—1.96,1.96]  interval. 
Now,  let  us  get  back  to  the  problem  of  estimating  /r  from  a  random  sample  xi, ...  ,xn  drawn 
from  a  N(/x,a2)  distribution.  We  found  out  that  Jj-mle  =  %  and  x  ~  N(/x,  a2/n).  Hence, 
z  =  (x  —  /r)/(<r/y/n)  is  IV  (0, 1).  The  point  estimate  for  /x  is  x  observed  from  the  sample,  and  the 
95%  confidence  interval  for  fx  is  obtained  by  replacing  z  by  its  value  in  the  above  probability 
statement: 

X  —  /A 

Pr  [  Za/2  <  —/  /=  <  zaj2]  =  1  -  a 
'  cr/yjn  ' 

Assuming  a  is  known  for  the  moment,  one  can  rewrite  this  probability  statement  after  some 
simple  algebraic  manipulations  as 

Pr[x  -  za/2(cr/^n)  <  /x  <  x  +  za/2(o / y/n)\  =  1  -  a 

Note  that  this  probability  statement  has  random  variables  on  both  ends  and  the  probability  that 
these  random  variables  sandwich  the  unknown  parameter  ^  is  1  —  a.  With  the  same  confidence 
of  drawing  100  random  1V(0, 1)  numbers  and  finding  95  of  them  falling  in  the  (—1.96, 1.96)  range 
we  are  confident  that  if  we  drew  a  100  samples  and  computed  a  100  x’s,  and  a  100  intervals 
(x  ±  1.96  a/y/n),  fi  will  lie  in  these  intervals  in  95  out  of  100  times. 

If  a  is  not  known,  and  is  replaced  by  s,  then  problem  12  shows  that  this  is  equivalent  to 
dividing  a  1V(0, 1)  random  variable  by  an  independent  Xn-i  random  variable  divided  by  its 
degrees  of  freedom,  leading  to  a  t-distribution  with  (n  —  1)  degrees  of  freedom.  Hence,  using  the 
f-tables  for  (n  —  1)  degrees  of  freedom 

Pi  [  ^a/2;n— 1  —  ^n—  1  T  ^a/2;n—l\  f  ® 

and  replacing  tn-i  by  (x  —  /r)/(s/y/n)  one  gets 

Pr[x  -  tQ/2;n-i(s/\/n)  <  At  <  X  +  ta/2;n-i(s/Vn)]  =  1  -  a 

Note  that  the  degrees  of  freedom  (n— 1)  for  the  t-distribution  come  from  s  and  the  corresponding 
critical  value  tn_ \-a/2  is  therefore  sample  specific,  unlike  the  corresponding  case  for  the  normal 
density  where  za/2  does  not  depend  on  n.  For  small  n,  the  ta/2  values  differ  drastically  from 
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Table  2.1  Descriptive  Statistics  for  the  Earnings  Data 


Sample:  1  595 


LWAGE 

WKS 

ED 

EX 

MS 

FEM 

BLK 

UNION 

Mean 

6.9507 

46.4520 

12.8450 

22.8540 

0.8050 

0.1126 

0.0723 

0.3664 

Median 

6.9847 

48.0000 

12.0000 

21.0000 

1.0000 

0.0000 

0.0000 

0.0000 

Maximum 

8.5370 

52.0000 

17.0000 

51.0000 

1.0000 

1.0000 

1.0000 

1.0000 

Minimum 

5.6768 

5.0000 

4.0000 

7.0000 

0.0000 

0.0000 

0.0000 

0.0000 

Std.  Dev. 

0.4384 

5.1850 

2.7900 

10.7900 

0.3965 

0.3164 

0.2592 

0.4822 

Skewness 

-0.1140 

-2.7309 

-0.2581 

0.4208 

-1.5400 

2.4510 

3.3038 

0.5546 

Kurtosis 

3.3937 

13.7780 

2.7127 

2.0086 

3.3715 

7.0075 

11.9150 

1.3076 

Jarque-Bera 

5.13 

3619.40 

8.65 

41.93 

238.59 

993.90 

3052.80 

101.51 

Probability 

0.0769 

0.0000 

0.0132 

0.0000 

0.0000 

0.0000 

0.0000 

0.0000 

Observations 

595 

595 

595 

595 

595 

595 

595 

595 

za/2,  emphasizing  the  importance  of  using  the  f-density  in  small  samples.  When  n  is  large  the 
difference  between  zQ/2  and  tQ/2  diminishes  as  the  f-density  becomes  more  like  a  normal  density. 
For  n  =  20,  and  a  =  0.05,  fa/2;n-i  =  2.093  as  compared  with  za/2  =  1.96.  Therefore, 

Pr[— 2.093  <  tn- 1  <  2.093]  =  0.95 

and  fj,  lies  in  x  ±  2.093 (s/y/n)  with  95%  confidence. 

More  examples  of  confidence  intervals  can  be  constructed,  but  the  idea  should  be  clear. 
Note  that  these  confidence  intervals  are  the  other  side  of  the  coin  for  tests  of  hypotheses.  For 
example,  in  testing  Hq ;  [i  =  2  versus  H\;  fx  ^  2  for  a  known  a,  we  discovered  that  the  Likelihood 
Ratio  test  is  based  on  the  same  probability  statement  that  generated  the  confidence  interval 
for  //.  In  classical  tests  of  hypothesis,  we  choose  the  level  of  confidence  a  =  5%  and  compute 
z  =  (x  —  n)/  (a/ y/n) .  This  can  be  done  since  a  is  known  and  /t  =  2  under  the  null  hypothesis 
Hq.  Next,  we  do  not  reject  Hq  if  z  lies  in  the  {—za/ 2,  za/ 2)  interval  and  reject  Hq  otherwise.  For 
confidence  intervals,  on  the  other  hand,  we  do  not  know  /i,  and  armed  with  a  level  of  confidence 
(1  —  a)%  we  construct  the  interval  that  should  contain  /i  with  that  level  of  confidence.  Having 
done  that,  if  n  =  2  lies  in  that  95%  confidence  interval,  then  we  cannot  reject  Hq;  =  2  at  the 
5%  level.  Otherwise,  we  reject  Hq.  This  highlights  the  fact  that  any  value  of  ^  that  lies  in  this 
95%  confidence  interval  (assuming  it  was  our  null  hypothesis)  cannot  be  rejected  at  the  5%  level 
by  this  sample.  This  is  why  we  do  not  say  “accept  Hq"  ,  but  rather  we  say  “do  not  reject  Hq"  . 


2.6  Descriptive  Statistics 

In  Chapter  4,  we  will  consider  the  estimation  of  a  simple  wage  equation  based  on  595  individuals 
drawn  from  the  Panel  Study  of  Income  Dynamics  for  1982.  This  data  is  available  on  the  Springer 
web  site  as  EARN.ASC.  Table  2.1  gives  the  descriptive  statistics  using  EViews  for  a  subset  of 
the  variables  in  this  data  set. 
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Series:  LWAGE 

Sample  1  595 

Observations  595 

Mean 

6.950745 

Median 

6.984720 

Maximum 

8.537000 

Minimum 

5.676750 

Std.  Dev. 

0.438403 

Skewness 

-0.114001 

Kurtosis 

3.393651 

Jarque-Bera 

5.130525 

Probability 

0.076899 

Figure  2.8  Log  (Wage)  Histogram 


Series:  WKS 

Sample  1 595 

Observations  595 

Mean 

46.45210 

Median 

48.00000 

Maximum 

52.00000 

Minimum 

5.000000 

Std.  Dev. 

5.185025 

Skewness 

-2.730880 

Kurtosis 

13.77787 

Jarque-Bera 

3619.416 

Probability 

0.000000 

Figure  2.9  Weeks  Worked  Histogram 


The  average  log  wage  is  $6.95  for  this  sample  with  a  minimum  of  $5.68  and  a  maximum  of 
$8.54.  The  standard  deviation  of  log  wage  is  0.44.  A  plot  of  the  log  wage  histogram  is  given 
in  Figure  2.8.  Weeks  worked  vary  between  5  and  52  with  an  average  of  46.5  and  a  standard 
deviation  of  5.2.  This  variable  is  highly  skewed  as  evidenced  by  the  histogram  in  Figure  2.9. 
Years  of  education  vary  between  4  and  17  with  an  average  of  12.8  and  a  standard  deviation 
of  2.79.  There  is  the  usual  bunching  up  at  12  years,  which  is  also  the  median,  as  is  clear  from 
Figure  2.10. 

Experience  varies  between  7  and  51  with  an  average  of  22.9  and  a  standard  deviation  of  10.79. 
The  distribution  of  this  variable  is  skewed  to  the  left,  as  shown  in  Figure  2.11. 

Marital  status  is  a  qualitative  variable  indicating  whether  the  individual  is  married  or  not. 
This  information  is  recoded  as  a  numeric  (1,0)  variable,  one  if  the  individual  is  married  and  zero 
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Figure  2.10  Years  of  Education  Histogram 


Series:  ED 

Sample  1 595 

Observation  595 

Mean 

12.84538 

Median 

12.00000 

Maximum 

17.00000 

Minimum 

4.000000 

Std.  Dev. 

2.790006 

Skewness 

-0.258116 

Kurtosis 

2.712730 

Jarque-Bera 

8.652780 

Probability 

0.013215 

Series:  EX 

Sample  1  595 

Observation  595 

Mean 

22.85378 

Median 

21.00000 

Maximum 

51.00000 

Minimum 

7.000000 

Std.  Dev. 

10.79018 

Skewness 

0.420826 

Kurtosis 

2.008578 

Jarque-Bera 

41.93007 

Probability 

0.000000 

otherwise.  This  recoded  variable  is  also  known  as  a  dummy  variable.  It  is  basically  a  switch 
turning  on  when  the  individual  is  married  and  off  when  he  or  she  is  not.  Female  is  another 
dummy  variable  taking  the  value  one  when  the  individual  is  a  female  and  zero  otherwise.  Black 
is  a  dummy  variable  taking  the  value  one  when  the  individual  is  black  and  zero  otherwise.  Union 
is  a  dummy  variable  taking  the  value  one  if  the  individual’s  wage  is  set  by  a  union  contract  and 
zero  otherwise.  The  minimum  and  maximum  values  for  these  dummy  variables  are  obvious.  But 
if  they  were  not  zero  and  one,  respectively,  you  know  that  something  is  wrong.  The  average  is  a 
meaningful  statistic  indicating  the  percentage  of  married  individuals,  females,  blacks  and  union 
contracted  wages  in  the  sample.  These  are  80.5,  11.3,7.2  and  30.6%,  respectively.  We  would 
like  to  investigate  the  following  claims:  (i)  women  are  paid  less  than  men;  (ii)  blacks  are  paid 
less  than  non-blacks;  (iii)  married  individuals  earn  more  than  non-married  individuals;  and  (iv) 
union  contracted  wages  are  higher  than  non-union  wages. 
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Table  2.2  Test  for  the  Difference  in  Means 


Average 
log  wage 

Difference 

Male 

$7,004 

-0.474 

Female 

$6,530 

(-8.86) 

Non-Black 

$6,978 

-0.377 

Black 

$6,601 

(-5.57) 

Not  Married 

$6,664 

0.356 

Married 

$7,020 

(8.28) 

Non-Union 

$6,945 

0.017 

Union 

$6,962 

(0.45) 

Table  2.3  Correlation  Matrix 


LWAGE 

WKS 

ED 

EX 

MS 

FEM 

BLK 

UNION 

LWAGE 

1.0000 

0.0403 

0.4566 

0.0873 

0.3218 

-0.3419 

-0.2229 

0.0183 

WKS 

0.0403 

1.0000 

0.0002 

-0.1061 

0.0782 

-0.0875 

-0.0594 

-0.1721 

ED 

0.4566 

0.0002 

1.0000 

-0.2219 

0.0184 

-0.0012 

-0.1196 

-0.2719 

EX 

0.0873 

-0.1061 

-0.2219 

1.0000 

0.1570 

-0.0938 

0.0411 

0.0689 

MS 

0.3218 

0.0782 

0.0184 

0.1570 

1.0000 

-0.7104 

-0.2231 

0.1189 

FEM 

-0.3419 

-0.0875 

-0.0012 

-0.0938 

-0.7104 

1.0000 

0.2086 

-0.1274 

BLK 

-0.2229 

-0.0594 

-0.1196 

0.0411 

-0.2231 

0.2086 

1.0000 

0.0302 

UNION 

0.0183 

-0.1721 

-0.2719 

0.0689 

0.1189 

-0.1274 

0.0302 

1.0000 

A  simple  first  check  could  be  based  on  computing  the  average  log  wage  for  each  of  these  cat¬ 
egories  and  testing  whether  the  difference  in  means  is  significantly  different  from  zero.  This 
can  be  done  using  a  t- test,  see  Table  2.2.  The  average  log  wage  for  males  and  females  is  given 
along  with  their  difference  and  the  corresponding  t-statistic  for  the  significance  of  this  differ¬ 
ence.  Other  rows  of  Table  2.2  give  similar  statistics  for  other  groupings.  In  Chapter  4,  we  will 
show  that  this  f-test  can  be  obtained  from  a  simple  regression  of  log  wage  on  the  categorical 
dummy  variable  distinguishing  the  two  groups.  In  this  case,  the  Female  dummy  variable.  From 
Table  2.2,  it  is  clear  that  only  the  difference  between  union  and  non-union  contracted  wages  are 
insignificant. 

One  can  also  plot  log  wage  versus  experience,  see  Figure  2.12,  log  wage  versus  education,  see 
Figure  2.13,  and  log  wage  versus  weeks,  see  Figure  2.14. 

The  data  shows  that,  in  general,  log  wage  increases  with  education  level,  weeks  worked,  but 
that  it  exhibits  a  rising  and  then  a  declining  pattern  with  more  years  of  experience.  Note  that 
the  f-tests  based  on  the  difference  in  log  wage  across  two  groupings  of  individuals,  by  sex,  race  or 
marital  status,  or  the  figures  plotting  log  wage  versus  education,  log  wage  versus  weeks  worked 
are  based  on  pairs  of  variables  in  each  case.  A  nice  summary  statistic  based  also  on  pairwise  com¬ 
parisons  of  these  variables  is  the  correlation  matrix  across  the  data.  This  is  given  in  Table  2.3. 

The  signs  of  this  correlation  matrix  give  the  direction  of  linear  relationship  between  the 
corresponding  two  variables,  while  the  magnitude  gives  the  strength  of  this  correlation.  In 
Chapter  3,  we  will  see  that  these  simple  correlations  when  squared  give  the  percentage  of 
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Figure  2.12  Log  (Wage)  Versus  Experience 


Figure  2.13 


Log  (Wage)  Versus  Education 
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WKS 


Figure  2.14  Log  (Wage)  Versus  Weeks 


variation  that  one  of  these  variables  explain  in  the  other.  For  example,  the  simple  correlation 
coefficient  between  log  wage  and  marital  status  is  0.32.  This  means  that  marital  status  explains 
(0.32)2  or  10%  of  the  variation  in  log  wage. 

One  cannot  emphasize  enough  how  important  it  is  to  check  one’s  data.  It  is  important  to 
compute  the  descriptive  statistics,  simple  plots  of  the  data  and  simple  correlations.  A  wrong 
minimum  or  maximum  could  indicate  some  possible  data  entry  errors.  Troughs  or  peaks  in  these 
plots  may  indicate  important  events  for  time  series  data,  like  wars  or  recessions,  or  influential 
observations.  More  on  this  in  Chapter  8.  Simple  correlation  coefficients  that  equal  one  indicate 
perfectly  collinear  variables  and  warn  of  the  failure  of  a  linear  regression  that  has  both  variables 
included  among  the  regressors,  see  Chapter  4. 
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Notes 

1.  Actually  E(s2)  =  a2  does  not  need  the  normality  assumption.  This  fact  along  with  the  proof  of 
(n  —  l)s2/cr2  ~  Xn- 1>  under  Normality,  can  be  easily  shown  using  matrix  algebra  and  is  deferred 
to  Chapter  7. 

2.  This  can  be  proven  using  the  Chebyshev’s  inequality,  see  Hogg  and  Craig  (1995). 

3.  See  Hogg  and  Craig  (1995)  for  the  type  of  regularity  conditions  needed  for  these  distributions. 

Problems 

1.  Variance  and  Covariance  of  Linear  Combinations  of  Random  Variables.  Let  a,b,c,d,e  and  /  be 
arbitrary  constants  and  let  X  and  Y  be  two  random  variables. 

(a)  Show  that  var(a  +  bX )  =  b 2  var(X). 

(b)  var(a  +  bX  +  cY)  =  62var(X)  +  c2  var(Y)  +  26c  cov(X,  Y). 

(c)  cov[(a  +  bX  +  cY),  (d  +  eX  +  fY)]  =  be  var(X)  +  cf  var(Y)  +  (6/  +  ce)  cov(X,  Y). 

2.  Independence  and  Simple  Correlation. 

(a)  Show  that  if  X  and  Y  are  independent,  then  E(XY)  =  E{X)E{Y)  =  pxpy  where  px  =  E(X) 
and  fiy  =  E(Y).  Therefore,  cov(X,  Y)  =  E(X  —  fix)(Y  —  py)  =  0. 

(b)  Show  that  if  Y  =  a  +  6X,  where  a  and  6  are  arbitrary  constants,  then  pxy  =  1  if  b  >  0  and 
-1  if  6  <  0. 

3.  Zero  Covariance  Does  Not  Necessarily  Imply  Independence.  Let  X  =  —2,  —1,0, 1,2  with  Pr[A'  = 
x]  =  1/5.  Assume  a  perfect  quadratic  relationship  between  Y  and  X ,  namely  Y  =  A2.  Show  that 
cov(X,  Y)  =  E(X3)  =  0.  Deduce  that  pXY  =  correlation  (A,  Y)  =  0.  The  simple  correlation  coef¬ 
ficient  Pxy  measures  the  strength  of  the  linear  relationship  between  A  and  Y.  For  this  example, 
it  is  zero  even  though  there  is  a  perfect  nonlinear  relationship  between  A  and  Y.  This  is  also  an 
example  of  the  fact  that  if  pXY  =  0,  then  A  and  Y  are  not  necessarily  independent.  pxy  =  0  is  a 
necessary  but  not  sufficient  condition  for  A  and  Y  to  be  independent.  The  converse,  however,  is 
true,  i.e.,  if  A  and  Y  are  independent,  then  pXY  =  0,  see  problem  2. 

4.  The  Binomial  Distribution  is  defined  as  the  number  of  successes  in  n  independent  Bernoulli  trials 
with  probability  of  success  9.  This  discrete  probability  function  is  given  by 

/(X;0)  =  (x)eX{1~dr~X  X  =  0’1’'--’n 

and  zero  elsewhere,  with  (^)  =  n!/[X!(n  —  A)!]. 

(a)  Out  of  20  candidates  for  a  job  with  a  probability  of  hiring  of  0.1.  Compute  the  probabilities 
of  getting  A  =  5  or  6  people  hired? 

(b)  Show  that  ( x)  =  in -x)  anc^  use  conclude  that  b(n,  A,  9)  =  b(n,  n  —  A,  1  —  9). 

(c)  Verify  that  E{X)  =  n9  and  var(A)  =  n9{  1  —  9). 

(d)  For  a  random  sample  of  size  n  drawn  from  the  Bernoulli  distribution  with  parameter  9 ,  show 
that  A  is  the  MLE  of  9. 

(e)  Show  that  A,  in  part  (d),  is  unbiased  and  consistent  for  9. 

(f)  Show  that  A,  in  part  (d),  is  sufficient  for  9. 
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(g)  Derive  the  Cramer-Rao  lower  bound  for  any  unbiased  estimator  of  9.  Is  X ,  in  part  (d),  MVU 
for  91 

(h)  For  n  =  20,  derive  the  uniformly  most  powerful  critical  region  of  size  a  <  0.05  for  testing 
Hq]  9  =  0.2  versus  Hi]  9  =  0.6.  What  is  the  probability  of  type  II  error  for  this  test  criteria? 

(i)  Form  the  Likelihood  Ratio  test  for  testing  Hq]  9  =  0.2  versus  Hi]  9  ^  0.2.  Derive  the  Wald 
and  LM  test  statistics  for  testing  H0  versus  Hi.  When  is  the  Wald  statistic  greater  than  the 
LM  statistic? 

5.  For  a  random  sample  of  size  n  drawn  from  the  Normal  distribution  with  mean  /i  and  variance  cr2. 

(a)  Show  that  s2  is  a  sufficient  statistic  for  cr2. 

(b)  Using  the  fact  that  [n  —  l)s2 /a2  is  Xn-i  (without  proof),  verify  that  E(s2)  =  a2  and  that 
var(s2)  =  2er4/(n  —  1)  as  shown  in  the  text. 

(c)  Given  that  cr2  is  unknown,  form  the  Likelihood  Ratio  test  statistic  for  testing  H0]  /z  =  2 
versus  Hi]  fj,  ^  2.  Derive  the  Wald  and  Lagrange  Multiplier  statistics  for  testing  H0  versus 
Hi.  Verify  that  they  are  given  by  the  expressions  in  example  4. 

(d)  Another  derivation  of  the  W  >  LR  >  LM  inequality  for  the  null  hypothesis  given  in  part  (c) 
can  be  obtained  as  follows:  Let  J1 ,  a2  be  the  restricted  maximum  likelihood  estimators  under 
Hq]  n  =  fi0.  Let  /z,  a2  be  the  corresponding  unrestricted  maximum  likelihood  estimators 
under  the  alternative  Hi] /j,  ^  /z0.  Show  that  W  =  —  21og[L(/z,  t?2)/L(/z,  a2)]  and  LM  = 
— 21og [L(/z,cf2)/L(/z,(72)]  where  L(/z,cr2)  denotes  the  likelihood  function.  Conclude  that  W  > 
LR  >  LM ,  see  Breusch  (1979).  This  is  based  on  Baltagi  (1994). 

(e)  Given  that  /z  is  unknown,  form  the  Likelihood  Ratio  test  statistic  for  testing  Hq]  a  =  3  versus 
Hi]<t^3. 

(f)  Form  the  Likelihood  Ratio  test  statistic  for  testing  H0]  fj,  =  2,  a2  =  4  against  the  alternative 
that  Hi]  fx  ^  2  or  a2  ^  4. 

(g)  For  n  =  20,  s2  =  9  construct  a  95%  confidence  interval  for  cr2. 

6.  The  Poisson  distribution  can  be  defined  as  the  limit  of  a  Binomial  distribution  as  n  — >  oo  and 
9  — >  0  such  that  n9  =  A  is  a  positive  constant.  For  example,  this  could  be  the  probability  of  a 
rare  disease  and  we  are  random  sampling  a  large  number  of  inhabitants,  or  it  could  be  the  rare 
probability  of  finding  oil  and  n  is  the  large  number  of  drilling  sights.  This  discrete  probability 
function  is  given  by 

e-AA'Y 

f{x'x)  =  ^cT  X  =  0’1’2”-- 

For  a  random  sample  from  this  Poisson  distribution 

(a)  Show  that  E(X)  =  X  and  var(X)  =  A. 

(b)  Show  that  the  MLE  of  A  is  A mle  =  X. 

(c)  Show  that  the  method  of  moments  estimator  of  A  is  also  X. 

(d)  Show  that  X  is  unbiased  and  consistent  for  A. 

(e)  Show  that  X  is  sufficient  for  A. 

(f)  Derive  the  Cramer-Rao  lower  bound  for  any  unbiased  estimator  of  A.  Show  that  X  attains 
that  bound. 

(g)  For  n  =  9,  derive  the  Uniformly  Most  Powerful  critical  region  of  size  a  <  0.05  for  testing 
H0]  A  =  2  versus  H i;  A  =  4. 
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(h)  Form  the  Likelihood  Ratio  test  for  testing  Ho;  A  =  2  versus  H i;  A  ^  2.  Derive  the  Wald  and 
LM  statistics  for  testing  Hq  versus  Hi.  When  is  the  Wald  test  statistic  greater  than  the  LM 
statistic? 

7.  The  Geometric  distribution  is  known  as  the  probability  of  waiting  for  the  first  success  in  indepen¬ 
dent  repeated  trials  of  a  Bernoulli  process.  This  could  occur  on  the  1st,  2nd,  3rd,.,  trials. 

g{X; 9 )  =  9{  1  -  9)x~1  for  X  =  1,  2, 3, . . . 

(a)  Show  that  E(X)  =  1/9  and  var(X)  =  (1  —  9)/62 . 

(b)  Given  a  random  sample  from  this  Geometric  distribution  of  size  n,  find  the  MLE  of  9  and 
the  method  of  moments  estimator  of  9. 

(c)  Show  that  X  is  unbiased  and  consistent  for  1/9. 

(d)  For  n  =  20,  derive  the  Uniformly  Most  Powerful  critical  region  of  size  a  <  0.05  for  testing 
H0~,  9  =  0.5  versus  Hi;  9  =  0.3. 

(e)  Form  the  Likelihood  Ratio  test  for  testing  Hq;  9  =  0.5  versus  Hi;  9  ^  0.5.  Derive  the  Wald 
and  LM  statistics  for  testing  Hq  versus  Hi.  When  is  the  Wald  statistic  greater  than  the  LM 
statistic? 

8.  The  Uniform  density,  defined  over  the  unit  interval  [0,1],  assigns  a  unit  probability  for  all  values 
of  X  in  that  interval.  It  is  like  a  roulette  wheel  that  has  an  equal  chance  of  stopping  anywhere 
between  0  and  1. 

/(X)  =1  0  <  X  <  1 

=  0  elsewhere 

Computers  are  equipped  with  a  Uniform  (0,1)  random  number  generator  so  it  is  important  to 
understand  these  distributions. 

(a)  Show  that  E(X)  =  1/2  and  var(X)  =  1/12. 

(b)  What  is  the  Pr[0.1  <  X  <  0.3]?  Does  it  matter  if  we  ask  for  the  Pr[0.1  <  X  <  0.3]? 

9.  The  Exponential  distribution  is  given  by 

/(X;  9)  =  X  >  0  and  9  >  0 

This  is  a  skewed  and  continuous  distribution  defined  only  over  the  positive  quadrant. 

(a)  Show  that  E(X)  =  9  and  var(X)  =  92 . 

(b)  Show  that  9mle  =  X. 

(c)  Show  that  the  method  of  moments  estimator  of  9  is  also  X. 

(d)  Show  that  X  is  an  unbiased  and  consistent  estimator  of  9. 

(e)  Show  that  X  is  sufficient  for  9. 

(f)  Derive  the  Cramer-Rao  lower  bound  for  any  unbiased  estimator  of  91  Is  X  MVU  for  91 

(g)  For  n  =  20,  derive  the  Uniformly  Most  Powerful  critical  region  of  size  a  <  0.05  for  testing 
Ho;  9  =  1  versus  Hi;  9  =  2. 

(h)  Form  the  Likelihood  Ratio  test  for  testing  Hq;  9  =  1  versus  Hi;  9  ^  1.  Derive  the  Wald  and 
LM  statistics  for  testing  Ho  versus  Hi.  When  is  the  Wald  statistic  greater  than  the  LM 
statistic? 
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10.  The  Gamma  distribution  is  given  by 

/(*;<*,/?)  =  w\^xa~le~x/0  forX>0 

1  ( a)p 

=  0  elsewhere 

where  a  and  /?  >  0  and  T(a)  =  (a  —  1)!  This  is  a  skewed  and  continuous  distribution. 

(a)  Show  that  E(X)  =  a  (3  and  var(X)  =  a/32 . 

(b)  For  a  random  sample  drawn  from  this  Gamma  density,  what  are  the  method  of  moments 
estimators  of  a  and  [31 

(c)  Verify  that  for  a  =  1  and  [3  =  0 ,  the  Gamma  probability  density  function  reverts  to  the 
Exponential  p.d.f.  considered  in  problem  9. 

(d)  We  state  without  proof  that  for  a  =  r/2  and  (3  =  2 ,  this  Gamma  density  reduces  to  a  x2 
distribution  with  r  degrees  of  freedom,  denoted  by  x2-  Show  that  E(x2)  =  r  and  var(x2)  =  2 r. 

(e)  For  a  random  sample  from  the  x2  distribution,  show  that  (X1X2,  ■  ■  ■ ,  Xn)  is  a  sufficient 
statistic  for  r. 

(f)  One  can  show  that  the  square  of  a  1V(0,1)  random  variable  is  a  x2  random  variable  with 
1  degree  of  freedom,  see  the  Appendix  to  the  chapter.  Also,  one  can  show  that  the  sum 
of  independent  x2’s  is  a  x2  random  variable  with  degrees  of  freedom  equal  the  sum  of  the 
corresponding  degrees  of  freedom  of  the  individual  x2’s,  see  problem  15.  This  will  prove  useful 
for  testing  later  on.  Using  these  results,  verify  that  the  sum  of  squares  of  m  independent 
N(0, 1)  random  variables  is  a  x2  with  m  degrees  of  freedom. 

11.  The  Beta  distribution  is  defined  by 

<«<><* <1 

=  0  elsewhere 

where  a  >  0  and  [3  >  0.  This  is  a  skewed  continuous  distribution. 

(a)  For  a  =  (3  =  1  this  reverts  back  to  the  Uniform  (0, 1)  probability  density  function.  Show 
that  E(X)  =  (a/ a  +  (3)  and  var(A)  =  a(3/(a  +  [3)2(a  +  [3  +  1). 

(b)  Suppose  that  a  =  1,  find  the  estimators  of  [3  using  the  method  of  moments  and  the  method 
of  maximum  likelihood. 

12.  The  t- distribution  with  r  degrees  of  freedom  can  be  defined  as  the  ratio  of  two  independent  random 
variables.  The  numerator  being  a  1V(0, 1)  random  variable  and  the  denominator  being  the  square- 
root  of  a  x2  random  variable  divided  by  its  degrees  of  freedom.  The  t-distribution  is  a  symmetric 
distribution  like  the  Normal  distribution  but  with  fatter  tails.  As  r  — >  00,  the  t-distribution 
approaches  the  Normal  distribution. 

(a)  Verify  that  if  X\, . . . ,  Xn  are  a  random  sample  drawn  from  a  N(/j,,a2)  distribution,  then 
z  =  (X  -  fjl) / (a / y/n)  is  iV(0,l). 

(b)  Use  the  fact  that  (n  —  l)s2/a2  ~  Xn-i  show  that  t  =  z/ ^/s2 /a2  =  (X  —  ^)/(s/y/n)  has  a 
t-distribution  with  (n  —  1)  degrees  of  freedom.  We  use  the  fact  that  s2  is  independent  of  X 
without  proving  it. 

(c)  For  n  =  16,  x  =  20  and  s 2  =  4,  construct  a  95%  confidence  interval  for  [i. 
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13.  The  F- distribution  can  be  defined  as  the  ratio  of  two  independent  x2  random  variables  each  divided 
by  its  corresponding  degrees  of  freedom.  It  is  commonly  used  to  test  the  equality  of  variances.  Let 
sf  be  the  sample  variance  from  a  random  sample  of  size  n\  drawn  from  N(p1,a2)  and  let  s2  be 
the  sample  variance  from  another  random  sample  of  size  ri2  drawn  from  iV(/i2,  erf).  We  know  that 
(rii  —  1  )s\/a\  is  X(ni_i)  and  (n2  —  1 )s\la\  is  X(„2_i)-  Taking  the  ratio  of  those  two  independent 
X2  random  variables  divided  by  their  appropriate  degrees  of  freedom  yields 

F=sllA 

Sl/CT  % 

which  under  the  null  hypothesis  =  a\  gives  F  =  sf/s|  and  is  distributed  as  F  with  (m  —  1) 

and  (n2  —  1)  degrees  of  freedom.  Both  s\  and  s2  are  observable,  so  F  can  be  computed  and 
compared  to  critical  values  for  the  ^-distribution  with  the  appropriate  degrees  of  freedom.  Two 
inspectors  drawing  two  random  samples  of  size  25  and  31  from  two  shifts  of  a  factory  producing 
steel  rods,  find  that  the  sampling  variance  of  the  lengths  of  these  rods  are  15.6  and  18.9  inches 
squared.  Test  whether  the  variances  of  the  two  shifts  are  the  same. 

14.  Moment  Generating  Function  (MGF). 

(a)  Derive  the  MGF  of  the  Binomial  distribution  defined  in  problem  4.  Show  that  it  is  equal  to 
[(1  -  0)  +  fle*]". 

(b)  Derive  the  MGF  of  the  Normal  distribution  defined  in  problem  5.  Show  that  it  is  ej‘,t+  2 a  4  . 

(c)  Derive  the  MGF  of  the  Poisson  distribution  defined  in  problem  6.  Show  that  it  is  eA(e  -1). 

(d)  Derive  the  MGF  of  the  Geometric  distribution  defined  in  problem  7.  Show  that  it  is  (9e4/[  1  — 
(!-%*]• 

(e)  Derive  the  MGF  of  the  Exponential  distribution  defined  in  problem  9.  Show  that  it  is  1/(1  — 

et). 

(f)  Derive  the  MGF  of  the  Gamma  distribution  defined  in  problem  10.  Show  that  it  is  (1  —  /3t)~a. 
Conclude  that  the  MGF  of  a  xf  is  (1  —  2 1)~%. 

(g)  Obtain  the  mean  and  variance  of  each  distribution  by  differentiating  the  corresponding  MGF 
derived  in  parts  (a)  through  (f). 

15.  Moment  Generating  Function  Method. 

(a)  Show  that  if  X\, . . .  ,Xn  are  independent  Poisson  distributed  with  parameters  (Aj)  respec¬ 
tively,  then  Y  =  ^"=1  W  is  Poisson  with  parameter  Y^i= 1  -V 

(b)  Show  that  if  X±, . . . ,  Xn  are  independent  Normally  distributed  with  parameters  (/q,  erf),  then 
Y  =  Y^i=  1  Xi  is  Normal  with  mean  hi  and  variance  Y^=  1  a'i- 

(c)  Deduce  from  part  (b)  that  if  X±, . . . ,  Xn  are  IIN(^i,  cr2),  then  X  ~  a2/n). 

(d)  Show  that  if  X\, . . .  ,Xn  are  independent  x2  distributed  with  parameters  (7 y)  respectively, 

then  Y  =  Y^i=i  W  is  X2  distributed  with  parameter  ri- 

16.  Best  Linear  Prediction.  (Problems  16  and  17  are  based  on  Amemiya  (1994)).  Let  X  and  Y  be  two 
random  variables  with  means  fj,x  and  tlY  and  variances  a\  and  erf-,  respectively.  Suppose  that 

p  =  correlation  (A,  Y)  =  <Jxy /cx^y 

where  <jxy  —  cov(A,  Y).  Consider  the  linear  relationship  Y  =  a  +  f3X  where  a  and  f3  are  scalars: 

(a)  Show  that  the  best  linear  predictor  of  Y  based  on  X,  where  best  in  this  case  means  the 
minimum  mean  squared  error  predictor  which  minimizes  E(Y  —  a  —  /3X)2  with  respect  to  a 
and  (3  is  given  by  Y  =  a  +  (3X  where  a  =  pY  ~  Phx  and  P  =  gxy  l<J2x  =  pay  /ax- 
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(b)  Show  that  the  var(Y)  =  p2o\-  and  that  u  =  Y  —  Y ,  the  prediction  error,  has  mean  zero  and 
variance  equal  to  (1  —  p2)ay-  Therefore,  p2  can  be  interpreted  as  the  proportion  of  <jy  that 
is  explained  by  the  best  linear  predictor  Y. 

(c)  Show  that  co v(Y,u)  =  0. 

17.  The  Best  Predictor.  Let  X  and  Y  be  the  two  random  variables  considered  in  problem  16.  Now 
consider  predicting  Y  by  a  general,  possibly  non-linear,  function  of  X  denoted  by  h(X). 

(a)  Show  that  the  best  predictor  of  Y  based  on  X,  where  best  in  this  case  means  the  minimum 
mean  squared  error  predictor  that  minimizes  E[Y  —  h(X)]2  is  given  by  h(X)  =  E(Y/X). 
Hint:  Write  E[Y  —  h(X)}2  as  E{[Y  —  E(Y/X)]  +  \E(Y/X)  —  h(X)]}'2.  Expand  the  square 
and  show  that  the  cross-product  term  has  zero  expectation.  Conclude  that  this  mean  squared 
error  is  minimized  at  h(X)  =  E(Y/X). 

(b)  If  X  and  Y  are  bivariate  Normal,  show  that  the  best  predictor  of  Y  based  on  X  is  identical 
to  the  best  linear  predictor  of  Y  based  on  X. 

18.  Descriptive  Statistics.  Using  the  data  used  in  section  2.6  based  on  595  individuals  drawn  from  the 
Panel  Study  of  Income  Dynamics  for  1982  and  available  on  the  Springer  web  site  as  EARN.ASC, 
replicate  the  tables  and  graphs  given  in  that  section.  More  specifically 

(a)  replicate  Table  2.1  which  gives  the  descriptive  statistics  for  a  subset  of  the  variables  in  this 
data  set. 

(b)  Replicate  Figures  2.6-2.11  which  plot  the  histograms  for  log  wage,  weeks  worked,  education 
and  experience. 

(c)  Replicate  Table  2.2  which  gives  the  average  log  wage  for  various  groups  and  test  the  difference 
between  these  averages  using  a  t- test. 

(d)  Replicate  Figure  2.12  which  plots  log  wage  versus  experience.  Figure  2.13  which  plots  log 
wage  versus  education  and  Figure  2.14  which  plots  log  wage  versus  weeks  worked. 

(e)  Replicate  Table  2.3  which  gives  the  correlation  matrix  among  a  subset  of  these  variables. 

19.  Conflict  Among  Criteria  for  Testing  Hypotheses:  Examples  from  Non-normal  Distributions.  This 
is  based  on  Baltagi  (2000).  Berndt  and  Savin  (1977)  showed  that  W  >  LR  >  LM  for  the  case  of 
a  multivariate  regression  model  with  normal  distrubances.  Ullah  and  Zinde- Walsh  (1984)  showed 
that  this  inequality  is  not  robust  to  non-normality  of  the  disturbances.  In  the  spirit  of  the  latter 
article,  this  problem  considers  simple  examples  from  non-normal  distributions  and  illustrates  how 
this  conflict  among  criteria  is  affected. 

(a)  Consider  a  random  sample  X\,X2,  ■  ■  ■  ,xn  from  a  Poisson  distribution  with  parameter  A.  Show 
that  for  testing  A  =  3  versus  A  y^  3  yields  W  >  LM  for  x  <  3  and  W  <  LM  for  x  >  3. 

(b)  Consider  a  random  sample  xi,  X2,  ■  ■  ■ ,  xn  from  an  Exponential  distribution  with  parameter 
0.  Show  that  for  testing  9  =  3  versus  9  y^  3  yields  W  >  LM  for  0  <  x  <  3  and  W  <  LM  for 
x  >  3. 

(c)  Consider  a  random  sample  x±,  X2,  ■  ■  ■ ,  xn  from  a  Bernoulli  distribution  with  parameter  9. 
Show  that  for  testing  9  =  0.5  versus  9  y^  0.5,  we  will  always  get  W  >  LM.  Show  also,  that 
for  testing  9  =  (2/3)  versus  9  y^  (2/3)  we  get  W  <  LM  for  (1/3)  <  x  <  (2/3)  and  W  >  LM 
for  (2/3)  <5<lor0<x<  (1/3). 
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Appendix 

Score  and  Information  Matrix:  The  likelihood  function  of  a  sample  X\, ... ,  Xn  drawn  from  / ( X, ,  6) 
is  really  the  joint  probability  density  function  written  as  a  function  of  9: 

L(9)  =  f(X1,...,Xn-9) 

This  probability  density  function  has  the  property  that  f  L(9)dx  =  1  where  the  integral  is  over  all 
X\, . . . ,  Xn  written  compactly  as  one  integral  over  x.  Differentiating  this  multiple  integral  with  respect 


Appendix  43 


to  9,  one  gets 
f  dL 


dd 


dx  =  0 


Multiplying  and  dividing  by  L,  one  gets 


is) i<,x= 

But  the  score  is  by  definition  S(9)  =  <91og L/d9.  Hence  £[5(0)]  =  0.  Differentiating  again  with  respect 
to  9 ,  one  gets 


/'  [^2logL^  r  ,  f  (d\ogL\  (8L 

J  [\  d92  )  1  V  99  )\d0 


dx  =  0 


Multiplying  and  dividing  the  second  term  by  L  one  gets 
E 


or 


E 


<92logL 

fdlogL >> 

2' 

=  0 

dd2  + 

V  99  y 

<92log  L 

=  E 

(  d\ogL 

dd2 

de 

=  E[S(9)( 


But  var[S(fl)]  =  E[S(9)}2  since  E[S(9))  =  0.  Hence  1(9)  =  var[S(0)]. 


Moment  Generating  Function  (MGF):  For  the  random  variable  X ,  the  expected  value  of  a  special 
function  of  X ,  namely  ext  is  denoted  by 

Mx(t)  =  E(ext)  =  E(  1  +  Xt  +  X2f-  +  X3^  +  ..) 

where  the  second  equality  follows  from  the  Taylor  series  expansion  of  ext  around  zero.  Therefore, 

Mx(t)  =  1  +  E(X)t  +  E(X2)t-  +  E(X3)^  +  .. 

This  function  of  t  generates  the  moments  of  X  as  coefficients  of  an  infinite  polynomial  in  t.  For  example, 
g  =  E(X)  =  coefficient  of  t,  and  E(X2)/ 2  is  the  coefficient  of  t2,  etc.  Alternatively,  one  can  differentiate 
this  MGF  with  respect  to  t  and  obtain  g  =  E(X)  =  M'x( 0),  i.e.,  the  first  derivative  of  Mx(t)  with 
respect  to  t  evaluated  at  t  =  0.  Similarly,  E(Xr )  =  M^(0)  which  is  the  r-th  derivative  of  Mx(t)  with 
respect  to  t  evaluated  at  t  =  0.  For  example,  for  the  Bernoulli  distribution; 

Mx(t)  =  E(ext)  =  £x=o  ext9x(  1  -  9)3~x  =  0e‘  +  (1-9) 

so  that  M'x(t)  =  9et  and  Mx( 0)  =  9  =  E(X)  and  Mx(t)  =  9et  which  means  that  E( X2)  =  Mx( 0)  =  9. 
Hence, 

var(X)  =  E(X2)  -  (E(X))2  =  9  -  92  =  9(1  -  9). 

For  the  Normal  distribution,  see  problem  14,  it  is  easy  to  show  that  if  X  ~  N(g,c r2),  then  Mx(t )  = 
eMt+|a2i2and  Mx(0)  =  E(X)  =  g  and  M"  (0)  =  E(X2)  =  o2  +  g2. 

There  is  a  one-to-one  correspondence  between  the  MGF  when  it  exists  and  the  corresponding  p.d.f. 
This  means  that  if  Y  has  a  MGF  given  by  e2*+4t  then  Y  is  normally  distributed  with  mean  2  and 
variance  8.  Similarly,  if  Z  has  a  MGF  given  by  (e*  +  l)/2,  then  Z  is  Bernoulli  distributed  with  9  =  1/2. 
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Change  of  Variable:  If  X  ~  N( 0, 1),  then  one  can  find  the  distribution  function  of  Y  =  |X|  by  using 
the  Distribution  Function  method.  By  definition  the  distribution  function  of  y  is  defined  as 

G{y)  =  Pr[Y  <y]=Pv[\X\  <y]  =  Pv[-y  <  X  <y] 

=  Pv[X  <y\-  Pv[X  <  -y)  =  F(y)  -  F{-y) 

so  that  the  distribution  function  of  Y.  G(y ),  can  be  obtained  from  the  distribution  function  of  X ,  F(x). 
Since  the  iV(0, 1)  distribution  is  symmetric  around  zero,  then  F(—y)  =  1  —  F(y)  and  substituting 
that  in  G(y )  we  get  G(y)  =  2 F(y)  —  1.  Recall,  that  the  p.d.f.  of  Y  is  given  by  g(y)  =  G'(y).  Hence, 
g(y)  =  f(y )  +  f(—y)  and  this  reduces  to  2 f(y)  if  the  distribution  is  symmetric  around  zero.  So  that  if 
/( x)  =  e~x2^2 /\/ 27t  for  — oo  <  x  <  +oo  then  g{y)  =  2 f(y)  =  2e_y2/2/i/27r  for  y  >  0. 

Let  us  now  find  the  distribution  of  Z  =  X2,  the  square  of  a  7V(0,1)  random  variable.  Note  that 
dZ / dX  =  2X  which  is  positive  when  X  >  0  and  negative  when  X  <  0.  The  change  of  variable  method 
cannot  be  applied  since  Z  =  X2  is  not  a  monotonic  transformation  over  the  entire  domain  of  X.  However, 
using  Y  =  |X|,  we  get  Z  =  Y2  =  (|X|)2  and  dZ/dY  =  2 Y  which  is  always  non-negative  since  Y  is  non¬ 
negative.  In  this  case,  the  change  of  variable  method  states  that  the  p.d.f.  of  Z  is  obtained  from  that 
of  Y  by  substituting  the  inverse  transformation  Y  =  \[Z  into  the  p.d.f.  of  Y  and  multiplying  it  by  the 
absolute  value  of  the  derivative  of  the  inverse  transformation: 

* <*>  -  ^  -  wx1'2^'2  fo” s  ° 

It  is  clear  why  this  transformation  will  not  work  for  X  since  Z  =  X2  has  two  solutions  for  the  inverse 
transformation,  X  =  iyZ,  whereas,  there  is  one  unique  solution  for  Y  =  \J~Z  since  it  is  non-negative. 
Using  the  results  of  problem  10,  one  can  deduce  that  Z  has  a  gamma  distribution  with  a  =  1/2  and 
/3  =  2.  This  special  Gamma  density  function  is  a  %2  distribution  with  1  degree  of  freedom.  Hence,  we 
have  shown  that  the  square  of  a  N( 0, 1)  random  variable  has  a  \i  distribution. 

Finally,  if  X\, . . . ,  Xn  are  independently  distributed  then  the  distribution  function  of  Y  =  Yli=i 
can  be  obtained  from  that  of  the  Xfs  using  the  Moment  Generating  Function  (MGF)  method: 

MY{t )  =  E(eYt)  =  E[e^=  * Xi)t]  =  E(eXlt)E(eX2t)..E(eXnt) 

=  MXl(t)MX2(t)..MXn(t) 

If  in  addition  these  Xfs  are  identically  distributed,  then  MXi(t )  =  Mx(t )  for  i  =  1, . . .  ,n  and 
MY(t)  =  [Mx(t)]n 

For  example,  if  Xi, . . . ,  Xn  are  IID  Bernoulli  ( 6 ),  then  MXi  (t)  =  Mx(t)  =  Oe*  +  (1  —  9)  for  i  =  1, . . . ,  n. 
Hence  the  MGF  of  Y  =  5Z”=i  is  given  by 

MY(t )  =  [Mx(t)]n  =  [0e‘  +  (1  -  9)]n 

This  can  be  easily  shown  to  be  the  MGF  of  the  Binomial  distribution  given  in  problem  14.  This  proves 
that  the  sum  of  n  independent  and  identically  distributed  Bernoulli  random  variables  with  parameter  9 
is  a  Binomial  random  variable  with  same  parameter  9. 


Central  Limit  Theorem:  If  Xi, . . . ,  Xn  are  IID(/r,  a2)  from  an  unknown  distribution,  then  Z 
is  asymptotically  distributed  as  IV (0, 1). 


X-g 

a/y/n 


Proof :  We  assume  that  the  MGF  of  the  Xfs  exist  and  derive  the  MGF  of  Z.  Next,  we  show  that  lim 
Mz(t )  as  n  — >  oo  is  e1/2*2  which  is  the  MGF  of  1V(0, 1)  distribution.  First,  note  that 


Z  = 


EIU  -  nh 


Y  —  ng 
oyfn 
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where  Y  =  Y^i=i  with  My(t)  =  [Mx{t)]n.  Therefore, 

MZ{t)  =  E(eZt)  =  E  =  e-n^t/a^E  ^ Yt/o VrA 

=  e-^^^Myit/aVn)  =  e~n»t/a^[Mx{t/ aVn)]n 
Taking  log  of  both  sides  we  get 


log Mz(t)  =  n ^  +  nlog[l  H - t-j=E(X) 


2a2', 


-E{X2) 


§a3n^/n 


E{X3) 


s 2  s3 

Using  the  Taylor  series  expansion  log(l  +  s)  =  s  — —  +  — —  ■  ■  we  get 

z  o 


\ogMz{t)  =  -^^t  +  n 

a 

i  r  t 

~2 

1 

+  3 

Collecting  powers  of  f,  we  get 


t 


t 2 


l  a^Jn  2 a2n  v  '  '  6a3n,y/n 

2  j.S  1  2 


U 


2<r2; 


2er2y/n 


-.E(X2) 


£(A:2) 


-E{X2) 
t3 

6  o3nyfn 
t3 


t 3 


E(X3) 


6a3n^/n 


E(X3)  + 
E(X3) 


i  3 


logMz(i)  =  + 


;  V  2cr2  2<r2  y 

/ E(X 3)  1  2fj,E(X2)  1  /z3  \ 
yGcr3^^  2  2a3y/n  3  a3y/n) 


t3  +  .. 


Therefore 


log  Mz(t)  =  -t  +  (  — - 


/  E(X3)  nE(X2)  /z3\  i 


3  y  a3y/n 


note  that  the  coefficient  of  t3  is  1  /y/n  times  a  constant.  Therefore,  this  coefficient  goes  to  zero  as  n  — >  oo. 
Similarly,  it  can  be  shown  that  the  coefficient  of  tr  is  1/Vnr~2  times  a  constant  for  r  >  3.  Hence, 

1  1  2 
lim  logMz(t)  =  -t2  and  lim  Mz{t)  =  e s* 

n—too  2  n— >-oo 

which  is  the  MGF  of  a  standard  normal  distribution. 

The  Central  Limit  Theorem  is  a  powerful  tool  for  asymptotic  inference.  In  real  life  we  do  not  know 
what  distribution  we  are  sampling  from,  but  as  long  as  the  sample  drawn  is  random  and  we  average  (or 
sum)  and  standardize  then  as  n  — *  oo,  the  resulting  standardized  statistic  has  an  asymptotic  -/V(0, 1) 
distribution  that  can  be  used  for  inference. 

Using  a  random  number  generator  from  say  the  uniform  distribution  on  the  computer,  one  can  generate 
samples  of  size  n  =  20,30,50  from  this  distribution  and  show  how  the  sampling  distribution  of  the  sum 
(or  average)  when  it  is  standardized  closely  approximates  the  7V(0,1)  distribution. 

The  real  question  for  the  applied  researcher  is  how  large  n  should  be  to  invoke  the  Central  Limit 
Theorem.  This  depends  on  the  distribution  we  are  drawing  from.  For  a  Bernoulli  distribution,  a  larger 
n  is  needed  the  more  asymmetric  this  distribution  is  i.e. ,  if  0  =  0.1  rather  than  0.5. 

In  fact,  Figure  2.15  shows  the  Poisson  distribution  with  mean  =  15.  This  looks  like  a  good  approx¬ 
imation  for  a  Normal  distribution  even  though  it  is  a  discrete  probability  function.  Problem  15  shows 
that  the  sum  of  n  independent  identically  distributed  Poisson  random  variables  with  parameter  A  is 
a  Poisson  random  variable  with  parameter  (nA).  This  means  that  if  A  =  0.15,  an  n  of  100  will  lead 
to  the  distribution  of  the  sum  being  Poisson  (nA  =  15)  and  the  Central  Limit  Theorem  seems  well 
approximated. 
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P(x) 


Figure  2.15  Poisson  Probability  Distribution,  Mean  =  15 
P(x) 


Figure  2.16  Poisson  Probability  Distribution,  Mean  =  1.5 


However,  if  A  =  0.015,  an  n  of  100  will  lead  to  the  distribution  of  the  sum  being  Poisson  (n\  =  1.5) 
which  is  given  in  Figure  2.16.  This  Poisson  probability  function  is  skewed  and  discrete  and  does  not 
approximate  well  a  normal  density.  This  shows  that  one  has  to  be  careful  in  concluding  that  n  =  100  is  a 
large  enough  sample  for  the  Central  Limit  Theorem  to  apply.  We  showed  in  this  simple  example  that  this 
depends  on  the  distribution  we  are  sampling  from.  This  is  true  for  Poisson  (A  =  0.15)  but  not  Poisson 
(A  =  0.015),  see  Joliffe  (1995).  The  same  idea  can  be  illustrated  with  a  skewed  Bernoulli  distribution. 


Conditional  Mean  and  Variance:  Two  random  variables  X  and  Y  are  bivariate  Normal  if  they  have 
the  following  joint  distribution: 


2naxaYy/l  —  p 2  P  1  2(1  -  p2) 


x  -  Hx 


-2  p 


x  ~  px 


y- 


fy-  i<y 
V  ov 


f(x,y) 


2 
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where  — oo  <  x  <  +oo,  — oo  <  y  <  +oo,  E(X)  =  px ,  E(Y )  =  pY:  var(X)  =  a\,  var(Y)  =  aY  and  p  = 
correlation  (X,  Y)  =  cov(X,  Y)  /o  xoY  ■  This  joint  density  can  be  rewritten  as 


f(x,y)  = 


-  p2 


exp  <  - 


2<4(1  -  p2) 


y  -  Py 


p—(x-px) 

<?X 


V27TCT.Y 


exP  s  - 


2a2x 


(x-  px)2  ?  =  f(y/x)fi{x) 


where  fi(x)  is  the  marginal  density  of  X  and  f(y/x)  is  the  conditional  density  of  Y  given  X.  In  this 
case,  X  ~  N(px,o2x)  and  Y/X  is  Normal  with  mean  E{Y / X)  =  pY  +  p—^{x  —  px)  and  variance  given 

®  X 

by  var(Y/X)  =  <7^(1  —  p2). 

By  symmetry,  the  roles  of  X  and  Y  can  be  interchanged  and  one  can  write  f(x,y)  =  f{x/y)  $2{y) 
where  f2(y)  is  the  marginal  density  of  Y.  In  this  case,  Y  ~  N(pY,aY)  and  X/Y  is  Normal  with  mean 

E(X/Y)  =  px  +  P~(y  —  Py )  an(l  variance  given  by  var {X/Y)  =  ax(l  —  p2).  If  p  =  0,  then  f{y/x)  = 
o  Y 

/2(y)  and  f(x,y)  =  fi(x)f2(y)  proving  that  X  and  Y  are  independent.  Therefore,  if  cov(X,  Y)  =  0  and 
X  and  Y  are  bivariate  Normal,  then  X  and  Y  are  independent.  In  general,  cov(X,  Y)  =  0  alone  does 
not  necessarily  imply  independence,  see  problem  3. 

One  important  and  useful  property  is  the  law  of  iterated  expectations.  This  says  that  the  expectation 
of  any  function  of  X  and  Y  say  h(X,  Y )  can  be  obtained  as  follows: 


E[h(X,Y)}  =  ExEY/x[h(X,Y)\ 

where  the  subscript  Y/X  on  E  means  the  conditional  expectation  of  Y  given  that  X  is  treated  as  a 
constant.  The  next  expectation  Ex  treats  X  as  a  random  variable.  The  proof  is  simple. 

/+oo  (-+00 

/  h(x,  y)f(x,  y)dxdy 
-oo  J  — OO 


where  f(x,y)  is  the  joint  density  of  X  and  Y.  But  f(x,y)  can  be  written  as  f(y/x)fi(x),  hence 


E[h(X,  Y)]  =  /_+“  h(x,  y)f{y/x)dy  h{x)dx  =  ExEY/x[h{X,  Y)]. 


p+OO 


Example:  This  law  of  iterated  expectation  can  be  used  to  show  that  for  the  bivariate  Normal  density, 
the  parameter  p  is  indeed  the  correlation  coefficient  of  A'  and  Y .  In  fact,  let  h(X,Y)  =  XY,  then 


E(XY) 


ExEy/x(XY/X)  =  ExXE(Y/X)  =  ExX[pY  +  p^{ X  -  px)} 

® X 

PxPy  +  P  —  ° A'  =  PxPy  +  PVY&X 

a  X 


Rearranging  terms,  one  gets  p  =  [E(XY)  —  pxpY\/ cr x<?y  =  &xy /ox<Jy  as  required. 

Another  useful  result  pertains  to  the  unconditional  variance  of  h(X,  Y )  being  the  sum  of  the  mean  of 
the  conditional  variance  and  the  variance  of  the  conditional  mean: 


var  (h(X,Y))  =  ExvarY/x[h(X,Y)\  +var  xEY/x[h(X,Y)] 

Proof:  We  will  write  h(X,  Y)  as  h  to  simplify  the  presentation 
var Y/x{h)  =  EY/X{h2)  -  [EY/X(h)]2 

and  taking  expectations  with  respect  to  X  yields  E\YarY/x{h)  =  E\EY/X{h2)  —  Ex[EY/x{h)]2 
=  E(h2)~Ex[EY/x(h)}2. 

Also,  var XEY/X(h)  =  EX[EY/X(h)}2  -  (EX[EY/X{h)})2  =  EX[EY/X(h)]2  -  [E(h)]2  adding  these  two 
terms  yields 

E(h2)  —  [ E(h )]2  =  var  (ft). 


CHAPTER  3 

Simple  Linear  Regression 

3.1  Introduction 

In  this  chapter,  we  study  extensively  the  estimation  of  a  linear  relationship  between  two  vari¬ 
ables,  Yi  and  Xi,  of  the  form: 

Yi  =  a  +  (3Xi  +  m  *  =  1,2,  ...,n  (3.1) 

where  Yt  denotes  the  *-th  observation  on  the  dependent  variable  Y  which  could  be  consumption, 
investment  or  output,  and  Xi  denotes  the  *-th  observation  on  the  independent  variable  X  which 
could  be  disposable  income,  the  interest  rate  or  an  input.  These  observations  could  be  collected 
on  firms  or  households  at  a  given  point  in  time,  in  which  case  we  call  the  data  a  cross-section. 
Alternatively,  these  observations  may  be  collected  over  time  for  a  specific  industry  or  country 
in  which  case  we  call  the  data  a  time-series,  n  is  the  number  of  observations,  which  could  be 
the  number  of  firms  or  households  in  a  cross-section,  or  the  number  of  years  if  the  observations 
are  collected  annually,  a  and  /?  are  the  intercept  and  slope  of  this  simple  linear  relationship 
between  Y  and  X.  They  are  assumed  to  be  unknown  parameters  to  be  estimated  from  the  data. 
A  plot  of  the  data,  i.e.,  Y  versus  X  would  be  very  illustrative  showing  what  type  of  relationship 
exists  empirically  between  these  two  variables.  For  example,  if  Y  is  consumption  and  X  is 
disposable  income  then  we  would  expect  a  positive  relationship  between  these  variables  and 
the  data  may  look  like  Figure  3.1  when  plotted  for  a  random  sample  of  households.  If  a  and 
/3  were  known,  one  could  draw  the  straight  line  (a  +  /3X)  as  shown  in  Figure  3.1.  It  is  clear 
that  not  all  the  observations  (X%,  Yt)  lie  on  the  straight  line  (a  +  (3X).  In  fact,  equation  (3.1) 
states  that  the  difference  between  each  Y%  and  the  corresponding  (a  +  (3Xi )  is  due  to  a  random 
error  Ujt .  This  error  may  be  due  to  (i)  the  omission  of  relevant  factors  that  could  influence 
consumption,  other  than  disposable  income,  like  real  wealth  or  varying  tastes,  or  unforseen 
events  that  induce  households  to  consume  more  or  less,  (ii)  measurement  error,  which  could  be 
the  result  of  households  not  reporting  their  consumption  or  income  accurately,  or  (iii)  wrong 
choice  of  a  linear  relationship  between  consumption  and  income,  when  the  true  relationship 
may  be  nonlinear.  These  different  causes  of  the  error  term  will  have  different  effects  on  the 
distribution  of  this  error.  In  what  follows,  we  consider  only  disturbances  that  satisfy  some 
restrictive  assumptions.  In  later  chapters  we  relax  these  assumptions  to  account  for  more  general 
kinds  of  error  terms. 

In  real  life,  a.  and  f3  are  not  known,  and  have  to  be  estimated  from  the  observed  data  {(Xi,  Yi) 
for  i  =  1,2, ...  ,n}.  This  also  means  that  the  true  line  (a  +  /3X)  as  well  as  the  true  disturbances 
(the  Ui  s)  are  unobservable.  In  this  case,  a  and  (3  could  be  estimated  by  the  best  fitting  line 
through  the  data.  Different  researchers  may  draw  different  lines  through  the  same  data.  What 
makes  one  line  better  than  another?  One  measure  of  misfit  is  the  amount  of  error  from  the 
observed  Yi  to  the  guessed  line,  let  us  call  the  latter  Y%  =  a  +  /3Xi,  where  the  hat  (~)  denotes 
a  guess  on  the  appropriate  parameter  or  variable.  Each  observation  (Xj,Y))  will  have  a  cor¬ 
responding  observable  error  attached  to  it,  which  we  will  call  e*  =  Yi  —  Yi,  see  Figure  3.2.  In 
other  words,  we  obtain  the  guessed  1^,(1) )  corresponding  to  each  Xi  from  the  guessed  line, 
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a  +  (3Xi.  Next,  we  find  our  error  in  guessing  that  T),  by  subtracting  the  actual  Y{  from  the 
guessed  Yi.  The  only  difference  between  Figure  3.1  and  Figure  3.2  is  the  fact  that  Figure  3.1 
draws  the  true  consumption  line  which  is  unknown  to  the  researcher,  whereas  Figure  3.2  is  a 
guessed  consumption  line  drawn  through  the  data.  Therefore,  while  the  Ui  s  are  unobservable, 
the  ej’s  are  observable.  Note  that  there  will  be  n  errors  for  each  line,  one  error  corresponding 
to  every  observation. 

Similarly,  there  will  be  another  set  of  n  errors  for  another  guessed  line  drawn  through  the 
data.  For  each  guessed  line,  we  can  summarize  its  corresponding  errors  by  one  number,  the  sum 
of  squares  of  these  errors,  which  seems  to  be  a  natural  criterion  for  penalizing  a  wrong  guess. 
Note  that  a  simple  sum  of  these  errors  is  not  a  good  choice  for  a  measure  of  misfit  since  positive 
errors  end  up  canceling  negative  errors  when  both  should  be  counted  in  our  measure.  However, 
this  does  not  mean  that  the  sum  of  squared  error  is  the  only  single  measure  of  misfit.  Other 
measures  include  the  sum  of  absolute  errors,  but  this  latter  measure  is  mathematically  more 
difficult  to  handle.  Once  the  measure  of  misfit  is  chosen,  a  and  (3  could  then  be  estimated  by 
minimizing  this  measure.  In  fact,  this  is  the  idea  behind  least  squares  estimation. 


3.2  Least  Squares  Estimation  and  the  Classical  Assumptions 

Least  squares  minimizes  the  residual  sum  of  squares  where  the  residuals  are  given  by 
ei  =  Yi  —  a  —  pXi  i  =  1, 2, . . . ,  n 

and  a  and  (3  denote  guesses  on  the  regression  parameters  a  and  (3,  respectively.  The  residual 
sum  of  squares  denoted  by  RSS  =  £™=1  e?  =  Xa=i(^  —  S  —  /3Xi )2  is  minimized  by  the  two 
first-order  conditions: 

<9(£r=i  e?)/3a  =  -2  £"=1  O  =  0;  or  £”=i  *  -  na  -  0  £?=1  X,  =  0  (3.2) 

n  ^ 

d(U=i  <%)/W  =  -2  Er=  i  *Xi  =  0;  or  ££,  YiXi  -  a  £  X,  -  (3  ££,  X?  =  0  (3.3) 

i— 1 

Solving  the  least  squares  normal  equations  given  in  (3.2)  and  (3.3)  for  a  and  (3  one  gets 

aoLS  =  Y-  Pols*  and  POLS  =  ££i  x m/  £”=1  xf  (3.4) 

where  Y  =  £(£  Yi/n,  X  =  ££,  Xj/n,  Vi  =  Yt  -  Y,  x*  =  X*  -  X,  £”=1  x ■  =  ££i  Xf  -  nX2, 

E"=i  v2i  =  E”=i  Yi  ~  nY 2  and  E”=i =  E"=i  xiY  -  nXY- 

These  estimators  are  subscripted  by  OLS  denoting  the  ordinary  least  squares  estimators.  The 
OLS  residuals  ei  =  Yi  —  a ols  ~  PoLSXi  automatically  satisfy  the  two  numerical  relationships 
given  by  (3.2)  and  (3.3).  The  first  relationship  states  that  (i)  £”=i  e*  =  tde  residuals  sum 
to  zero.  This  is  true  as  long  as  there  is  a  constant  in  the  regression.  This  numerical  property 
of  the  least  squares  residuals  also  implies  that  the  estimated  regression  line  passes  through  the 
sample  means  (X,  Y).  To  see  this,  average  the  residuals,  or  equation  (3.2),  this  gives  immediately 
Y  =  aoLS  +  Polsx-  Tde  second  relationship  states  that  (ii)  £(£  aXi  =  0,  the  residuals  and 
the  explanatory  variable  are  uncorrelated.  Other  numerical  properties  that  the  OLS  estimators 
satisfy  are  the  following:  (iii)  £”=1  Yj  =  £"=1  Y  and  (iv)  £"=i  eiYi  =  0-  Property  (iii)  states 
that  the  sum  of  the  estimated  Y)’s  or  the  predicted  Yi  s  from  the  sample  is  equal  to  the  sum  of  the 
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Figure  3.1  ‘True’  Consumption  Function  Figure  3.2  Estimated  Consumption  Function 


actual  y-’s.  Property  (iv)  states  that  the  OLS  residuals  and  the  predicted  Y)’s  are  uncorrelated. 
The  proof  of  (iii)  and  (iv)  follow  from  (i)  and  (ii)  see  problem  1.  Of  course,  underlying  our 
estimation  of  (3.1)  is  the  assumption  that  (3.1)  is  the  true  model  generating  the  data.  In  this 
case,  (3.1)  is  linear  in  the  parameters  a  and  (3,  and  contains  only  one  explanatory  variable 
Xi  besides  the  constant.  The  inclusion  of  other  explanatory  variables  in  the  model  will  be 
considered  in  Chapter  4,  and  the  relaxation  of  the  linearity  assumption  will  be  considered  in 
Chapters  8  and  13.  In  order  to  study  the  statistical  properties  of  the  OLS  estimators  of  a  and 
f3,  we  need  to  impose  some  statistical  assumptions  on  the  model  generating  the  data. 

Assumption  1:  The  disturbances  have  zero  mean,  i.e.,  E(uf)  =  0  for  every  i  =  1,2, ...  ,n.  This 
assumption  is  needed  to  insure  that  on  the  average  we  are  on  the  true  line. 

To  see  what  happens  if  E(uf)  ^  0,  consider  the  case  where  households  consistently  under-report 
their  consumption  by  a  constant  amount  of  8  dollars,  while  their  income  is  measured  accurately, 
say  by  cross-referencing  it  with  their  IRS  tax  forms.  In  this  case, 

(■ Observed  Consumption )  =  ( True  Consumption )  —  8 

and  our  regression  equation  is  really 

(True  Consumption )i  =  a  +  /3(Income)i  +  Ui 

But  we  observe, 

(Observed  C onsumption)i  =  a  +  (3(Income)i  +  Ui  —  8 

This  can  be  thought  of  as  the  old  regression  equation  with  a  new  disturbance  term  u*  =  Ui  —  8. 
Using  the  fact  that  8  >  0  and  E(uf)  =  0,  one  gets  E(u*)  =  —8  <  0.  This  says  that  for 
all  households  with  the  same  income,  say  $20,  000,  their  observed  consumption  will  be  on  the 
average  below  that  predicted  from  the  true  line  [a+/3($20, 000)]  by  an  amount  8.  Fortunately,  one 
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can  deal  with  this  problem  of  constant  but  non-zero  mean  of  the  disturbances  by  reparametizing 
the  model  as 

(■ Observed  Consumption )i  =  a*  +  (3(Income)i  +  Ui 

where  a*  =  a  —  6.  In  this  case,  E(ui)  =  0  and  a*  and  f3  can  be  estimated  from  the  regression. 
Note  that  while  a*  is  estimable,  a  and  6  are  non-estinrable.  Also  note  that  for  all  $20, 000 
income  households,  their  average  consumption  is  [(a  —  6)  +  /3($20,  000)]. 

Assumption  2:  The  disturbances  have  a  constant  variance,  i.e. ,  var(tq)  =  a2  for  every  i  = 
1,2, ...  ,n.  This  insures  that  every  observation  is  equally  reliable. 

To  see  what  this  assumption  means,  consider  the  case  where  var(rq)  =  o2,  for  i  =  1,2, ...  ,n. 
In  this  case,  each  observation  has  a  different  variance.  An  observation  with  a  large  variance  is 
less  reliable  than  one  with  a  smaller  variance.  But,  how  can  this  differing  variance  happen?  In 
the  case  of  consumption,  households  with  large  disposable  income  (a  large  Xi,  say  $100,000) 
may  be  able  to  save  more  (or  borrow  more  to  spend  more)  than  households  with  smaller  income 
(a  small  Xi,  say  $10,000).  In  this  case,  the  variation  in  consumption  for  the  $100,000  income 
household  will  be  much  larger  than  that  for  the  $10,000  income  household.  Therefore,  the 
corresponding  variance  for  the  $100,000  observation  will  be  larger  than  that  for  the  $10,000 
observation.  Consequences  of  different  variances  for  different  observations  will  be  studied  more 
rigorously  in  Chapter  5. 

Assumption  3:  The  disturbances  are  not  correlated,  i.e.,  E(uiUj)  =  0  for  i  /  j,  i,  j  =  1,  2, . . . ,  n. 
Knowing  the  i- th  disturbance  does  not  tell  us  anything  about  the  j-th  disturbance,  for  i  ^  j. 

For  the  consumption  example,  the  unforseen  disturbance  which  caused  the  i-th  household  to 
consume  more,  (like  a  visit  of  a  relative),  has  nothing  to  do  with  the  unforseen  disturbances  of 
any  other  household.  This  is  likely  to  hold  for  a  random  sample  of  households.  However,  it  is 
less  likely  to  hold  for  a  time  series  study  of  consumption  for  the  aggregate  economy,  where  a 
disturbance  in  1945,  a  war  year,  is  likely  to  affect  consumption  for  several  years  after  that.  In 
this  case,  we  say  that  the  disturbance  in  1945  is  related  to  the  disturbances  in  1946,  1947,  and 
so  on.  Consequences  of  correlated  disturbances  will  be  studied  in  Chapter  5. 

Assumption  4:  The  explanatory  variable  X  is  nonstochastic,  i.e.,  fixed  in  repeated  samples, 
and  hence,  not  correlated  with  the  disturbances.  Also,  ^"=1  xi/n  7^  0  and  has  a  finite  limit  as 
n  tends  to  infinity. 

This  assumption  states  that  we  have  at  least  two  distinct  values  for  X.  This  makes  sense,  since 
we  need  at  least  two  distinct  points  to  draw  a  straight  line.  Otherwise  X  =  X,  the  common 
value,  and  x  =  X  —  X  =  0,  which  violates  l  xi  /  0-  In  practice,  one  always  has  several 
distinct  values  of  A.  More  importantly,  this  assumption  implies  that  X  is  not  a  random  variable 
and  hence  is  not  correlated  with  the  disturbances. 

In  section  5.3,  we  will  relax  the  assumption  of  a  non-stochastic  X.  Basically,  X  becomes 
a  random  variable  and  our  assumptions  have  to  be  recast  conditional  on  the  set  of  X’s  that 
are  observed.  This  is  the  more  realistic  case  with  economic  data.  The  zero  mean  assumption 
becomes  E(m/X)  =  0,  the  constant  variance  assumption  becomes  vav(ui/X)  =  a2,  the  no  serial 
correlation  assumption  becomes  E{iiiUj  /  X)  =  0  for  i  ^  j.  The  conditional  expectation  here  is 
with  respect  to  every  observation  on  Xi  from  i  =  1,2,.  ..n.  Of  course,  one  can  show  that  if 
E(ui/ X)  =  0  for  all  i,  then  Xi  and  Ui  are  not  correlated.  The  reverse  is  not  necessarily  true,  see 
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Figure  3.3  Consumption  Function  with  Cov(A',  u)  >  0 


problem  3  of  Chapter  2.  That  problem  shows  that  two  random  variables,  say  Ui  and  Xj  could  be 
uncorrelated,  i.e. ,  not  linearly  related  when  in  fact  they  are  nonlinearly  related  with  ut  =  Xf. 
Hence,  E{ui/Xi )  =  0  is  a  stronger  assumption  than  m  and  X%  are  not  correlated.  By  the  law  of 
iterated  expectations  given  in  the  Appendix  of  Chapter  2,  E(ui/X )  =  0  implies  that  E(ui )  =  0. 
It  also  implies  that  Ui  is  uncorrelated  with  any  function  of  Xt.  This  is  a  stronger  assumption 
than  Ui  is  uncorrelated  with  Xj.  Therefore,  conditional  on  Xt,  the  mean  of  the  disturbances  is 
zero  and  does  not  depend  on  X$.  In  this  case,  E{Yi/Xj)  =  a  +  /3X*  is  linear  in  a  and  (3  and  is 
assumed  to  be  the  true  conditional  mean  of  Y  given  X. 

To  see  what  a  violation  of  assumption  4  means,  suppose  that  X  is  a  random  variable  and  that 
X  and  u  are  positively  correlated,  then  in  the  consumption  example,  households  with  income 
above  the  average  income  will  be  associated  with  disturbances  above  their  mean  of  zero,  and 
hence  positive  disturbances.  Similarly,  households  with  income  below  the  average  income  will  be 
associated  with  disturbances  below  their  mean  of  zero,  and  hence  negative  disturbances.  This 
means  that  the  disturbances  are  systematically  affected  by  values  of  the  explanatory  variable 
and  the  scatter  of  the  data  will  look  like  Figure  3.3.  Note  that  if  we  now  erase  the  true  line 
( a  +  /3X),  and  estimate  this  line  from  the  data,  the  least  squares  line  drawn  through  the  data 
is  going  to  have  a  smaller  intercept  and  a  larger  slope  than  those  of  the  true  line.  The  scatter 
should  look  like  Figure  3.4  where  the  disturbances  are  random  variables,  not  correlated  with 
the  Xj’s,  drawn  from  a  distribution  with  zero  mean  and  constant  variance.  Assumptions  1  and 
4  insure  that  E(Yi/Xi)  =  a  +  /3Xj,  i.e.,  on  the  average  we  are  on  the  true  line.  Several  economic 
models  will  be  studied  where  X  and  u  are  correlated.  The  consequences  of  this  correlation  will 
be  studied  in  Chapters  5  and  11. 

We  now  generate  a  data  set  which  satisfies  all  four  classical  assumptions.  Let  a  and  (3  take 
the  arbitrary  values,  say  10  and  0.5  respectively,  and  consider  a  set  of  20  fixed  X’s,  say  income 
classes  from  $10  to  $105  (in  thousands  of  dollars),  in  increments  of  $5,  i.e.,  $10,  $15,  $20, 
$25, . . .  ,$105.  Our  consumption  variable  Y)  is  constructed  as  (10  +  0.5Xj  +  m)  where  Ui  is  a 
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Figure  3.4  Random  Disturbances  Around  the  Regression 


disturbance  which  is  a  random  draw  from  a  distribution  with  zero  mean  and  constant  variance, 
say  a2  =  9.  Computers  generate  random  numbers  with  various  distributions. 

In  this  case,  Figure  3.4  would  depict  our  data,  with  the  true  line  being  (10  +  0.5V)  and  ui 
being  random  draws  from  the  computer  which  are  by  construction  independent  and  identically 
distributed  with  mean  zero  and  variance  9.  For  every  set  of  20  Ui  s  randomly  generated,  given 
the  fixed  XiS,  we  obtain  a  corresponding  set  of  20  Yi  s  from  our  linear  regression  model.  This  is 
what  we  mean  in  assumption  4  when  we  say  that  the  V’s  are  fixed  in  repeated  samples.  Monte 
Carlo  experiments  generate  a  large  number  of  samples,  say  a  1000,  in  the  fashion  described 
above.  For  each  data  set  generated,  least  squares  can  be  performed  and  the  properties  of  the 
resulting  estimators  which  are  derived  analytically  in  the  remainder  of  this  chapter,  can  be 
verified.  For  example,  the  average  of  the  1000  estimates  of  a  and  (3  can  be  compared  to  their 
true  values  to  see  whether  these  least  squares  estimates  are  unbiased.  Note  what  will  happen  to 
Figure  3.4  if  E(ui)  =  —8  where  6  >  0,  or  var(rq)  =  a\  for  i  =  1,2 ,n.  In  the  first  case,  the 
mean  of  f(u),  the  probability  density  function  of  u,  will  shift  off  the  true  line  (10  + 0.5  V)  by  —8. 
In  other  words,  we  can  think  of  the  distributions  of  the  Ui  s,  shown  in  Figure  3.4  ,  being  centered 
on  a  new  imaginary  line  parallel  to  the  true  line  but  lower  by  a  distance  8.  This  means  that  one 
is  more  likely  to  draw  negative  disturbances  than  positive  disturbances,  and  the  observed  Yt’s 
are  more  likely  to  be  below  the  true  line  than  above  it.  In  the  second  case,  each  f{ui)  will  have 
a  different  variance,  hence  the  spread  of  this  probability  density  function  will  vary  with  each 
observation.  In  this  case,  Figure  3.4  will  have  a  distribution  for  the  itj’s  which  has  a  different 
spread  for  each  observation.  In  other  words,  if  the  Ui  s  are  say  normally  distributed,  then  u\  is 
drawn  from  a  1V(0,<t^)  distribution,  whereas  u 2  is  drawn  from  a  1V(0,  o^)  distribution,  and  so 
on.  Violation  of  the  classical  assumptions  can  also  be  studied  using  Monte  Carlo  experiments, 
see  Chapter  5. 
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(i)  Unbiasedness 


Given  assumptions  1-4,  it  is  easy  to  show  that  Pols  is  unbiased  for  (3.  In  fact,  using  equation 
(3.4)  one  can  write 


Pols  =  E?=i  xtfs/  £”=i  *?  =  £7=i  *&/  £?=1  X1  =  P  +  E£i  W  E2=i  x? 


(3.5) 


where  the  second  equality  follows  from  the  fact  that  yi  =  Yt  —  Y  and  E?=i  =  F"  ££i  =  0- 
The  third  equality  follows  from  substituting  Y)  from  (3.1)  and  using  the  fact  that  E£i  =  0- 
Taking  expectations  of  both  sides  of  (3.5)  and  using  assumptions  1  and  4,  one  can  show  that 
E(Pqls)  =  P-  Furthermore,  one  can  derive  the  variance  of  Pols  from  (3.5)  since 

var (Pols)  =  e(Pols  -  Pf  =  E(T,7=  i  V  ££i  (3-6) 


=  var(Er=i  W  E”=i  x?)  =  a2/  EiU 


where  the  last  equality  uses  assumptions  2  and  3,  i.e.,  that  the  up s  are  not  correlated  with  each 
other  and  that  their  variance  is  constant,  see  problem  4.  Note  that  the  variance  of  the  OLS 
estimator  of  P  depends  upon  a2,  the  variance  of  the  disturbances  in  the  true  model,  and  on 
the  variation  in  X.  The  larger  the  variation  in  X  the  larger  is  £,)=i  x2  and  the  smaller  is  the 
variance  of  Pols- 


(ii)  Consistency 

Next,  we  show  that  Pols  is  consistent  for  p.  A  sufficient  condition  for  consistency  is  that  Pols 
is  unbiased  and  its  variance  tends  to  zero  as  n  tends  to  infinity.  We  have  already  shown  Pols 
to  be  unbiased,  it  remains  to  show  that  its  variance  tends  to  zero  as  n  tends  to  infinity. 

lirn  var {Pols)  =  lim  [^2/n)/(Hi=i  xi/n)]  =  0 

n — >oo  n — >oo 

where  the  second  equality  follows  from  the  fact  that  (a2 /n)  — >  0  and  (£"_i  xi/n )  /  0  and  has 
a  finite  limit,  see  assumption  4.  Hence,  plim  Pols  =  P  and  Pols  is  consistent  for  p.  Similarly 
one  can  show  that  Pols  is  unbiased  and  consistent  for  a  with  variance  a2  /n  ££1  xf  > 

and  cov (a0LS,  Pols)  =  -Xa2/  E"=i  xb  see  problem  5. 


(iii)  Best  Linear  Unbiased 

Using  (3.5)  one  can  write  Pols  as  £"=1  wiXi  where  Wi  =  £«/££  1x2.  This  proves  that  Pols 
is  a  linear  combination  of  the  Yp s,  with  weights  Wi  satisfying  the  following  properties: 


£2=  1 VH  =  0;  ££r  WiXi  =  1;  E£i  ^  =  1/  £2=1 


(3.7) 


The  next  theorem  shows  that  among  all  linear  unbiased  estimators  of  /?,  it  is  Pols  which  has 
the  smallest  variance.  This  is  known  as  the  Gauss-Markov  Theorem. 

Theorem  1:  Consider  any  arbitrary  linear  estimator  P  =  E£i  for  P,  where  the  aP s  denote 
arbitrary  constants.  If  P  is  unbiased  for  /?,  and  assumptions  1  to  4  are  satisfied,  then  var(/3)  > 

™{Pols)- 
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Proof:  Substituting  Y*  from  (3.1)  into  /3,  one  gets  (3  =  a  E£=i  ^i+P  EIL  i  ajAj+E"=i  aiui-  For  (3 
to  be  unbiased  for  (3  it  must  follow  that  E{(3 )  =  a  E"=i  a-i+P  EILi  aiXi  =  P  for  hh  observations 
i  =  1, 2, . . . ,  n.  This  means  that  Ya= i  a*  =  0  and  EILi  aiXi  =  1  for  all  i  =  1, 2, . . . ,  n.  Hence, 
(3  =  [3  +  E[Li  aiui  with  var {(3)  =  var(E"=i  OjUp  =  cr2  E™=i  ai  where  the  last  equality  follows 
from  assumptions  2  and  3.  But  the  afs  are  constants  which  differ  from  the  wf s,  the  weights  of 
the  OLS  estimator,  by  some  other  constants,  say  df  s,  i.e. ,  a*  =  wi  4-  di  for  i  =  1, 2, . . . ,  n.  Using 
the  properties  of  the  af  s  and  Wi  one  can  deduce  similar  properties  on  the  df  s  i.e.,  EIE  di  =  0 
and  EiL  i  d{Xi  =  0.  In  fact, 

E"=i  a?  =  E?=i  df  +  £?=i  +  2  EEi 

where  E?"=i  widi  =  ET=i  xidi/  Eit=i  xi  =  0-  This  follows  from  the  definition  of  Wi  and  the  fact 
that  Ya= i  di  =  E"=i  d%Xi  =  0.  Hence, 


var(/3)  =  u2  EIU  =  *2  E?=i  +  ^2  EEi  =  ™(Pols)  +  a2  E7=i  d\ 


Since  a2  EEi  is  non-negative,  this  proves  that  var(/3)  >  var (Pols)  with  the  equality  holding 
only  if  di  =  0  for  all  i  =  1,2, ...  ,n,  i.e.,_only  if  cn  =  Wi,  in  which  case  (3  reduces  to  Pols . 
Therefore,  any  linear  estimator  of  (3,  like  (3  that  is  unbiased  for  f3  has  variance  at  least  as  large 
as  vav(POLS).  This  proves  that  Pols  is  BLUE,  Best  among  all  Linear  Unbiased  Estimators  of 

(3. 

Similarly,  one  can  show  that  Pols  is  linear  in  Yj  and  has  the  smallest  variance  among  all 
linear  unbiased  estimators  of  a,  if  assumptions  1  to  4  are  satisfied,  see  problem  6.  This  result 
implies  that  the  OLS  estimator  of  a  is  also  BLUE. 


3.4  Estimation  of  cr2 


The  variance  of  the  regression  disturbances  a2  is  unknown  and  has  to  be  estimated.  In  fact, 
both  the  variance  of  Pols  and  that  of  Sols  depend  upon  a2,  see  (3.6)  and  problem  5.  An 
unbiased  estimator  for  a2  is  s 2  =  Ei=i  ei/(n  ~~  2).  To  prove  this,  we  need  the  fact  that 

ei  =  Yi-  Pols  -  PoLSXi  =  Vi  -  PoLSxi  =  (P~  PoLs)xi  +  (Ui  ~  u) 

where  u  =  YPi=iui/n-  The  second  equality  substitutes  Pols  =  Y  —  PQlsx  and  the  third 
equality  substitutes  y*  =  Pxi  +  (itj  —  u).  Hence, 

E?=i  O2  =  ( Pols  ~  P?  E£=i  xi  +  E£=i(«*  -  nf  -  2(POLS  -  P)  E”=i  *i(«i  -  «), 


and 

E{ E’Lie?)  =  Er=l  xiYai0(PoLs)  +  (n  —  l)®-2  —  2Ei(E”=i  xiUi)2/  EILi  xi 
=  (J2  +  (n  —  l)cr2  —  2<r2  =  (n  —  2)cr2 

where  the  first  equality  uses  the  fact  that  E(El=i (?i*  —  h)2)  =  (n  —  1  )cr2  and  Pols  ~  P  = 
EEl  xiui/  EiLi  xi  ■  The  second  equality  uses  the  fact  that  var(P0LS )  =  °'2/  E”=i  anh 

^(E”=i^)2  =  ^2Er=i^ 
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Therefore,  E{s2)  =  E(YH Li  ef/(n  —  2))  =  a2. 

Intuitively,  the  estimator  of  a2  could  be  obtained  from  Yli=i(ui  ~  u)2 /(n  —  1)  if  the  true 
disturbances  were  known.  Since  the  u’s  are  not  known,  consistent  estimates  of  them  are  used. 
These  are  the  ej’s.  Since  ^^=1  e*  =  0)  our  estimator  of  a2  becomes  Y17=iei/(n  —  !)•  Taking 
expectations  we  find  that  the  correct  divisor  ought  to  be  (n  —  2)  and  not  (n  —  1)  for  this  estimator 
to  be  unbiased  for  a2.  This  is  plausible,  since  we  have  estimated  two  parameters  a  and  P  in 
obtaining  the  ej’s,  and  there  are  only  n  —  2  independent  pieces  of  information  left  in  the  data. 
To  prove  this  fact,  consider  the  OLS  normal  equations  given  in  (3.2)  and  (3.3).  These  equations 
represent  two  relationships  involving  the  ej’s.  Therefore,  knowing  (n  —  2)  of  the  ej’s  we  can 
deduce  the  remaining  two  ej’s  from  (3.2)  and  (3.3). 


3.5  Maximum  Likelihood  Estimation 


Assumption  5:  The  itj’s  are  independent  and  identically  distributed  N( 0,a2). 

This  assumption  allows  us  to  derive  distributions  of  estimators  and  other  test  statistics.  In 
fact  using  (3.5)  one  can  easily  see  that  Pols  Is  a  linear  combination  of  the  rij’s.  But,  a  linear 
combination  of  normal  random  variables  is  itself  a  normal  random  variable,  see  Chapter  2, 
problem  15.  Hence,  PQls  is  N(/ 3,  a2/  Y/Z=  l  xi)-  Similarly  Pols  is  N(a,  a2  YJl=i  x‘i/n  Ya= i  xi  )> 
and  Yt  is  N(a+pXi,  a2).  Moreover,  we  can  write  the  joint  probability  density  function  of  the  it’s 
as  f(ui,U2,---,un]a,P,a2)  =  (l/27TCT2)n/2exp(— u|/2cr2).  To  get  the  likelihood  function 
we  make  the  transformation  Uj  =  Tj  —  a  —  /3Xi  and  note  that  the  Jacobian  of  the  transformation 
is  1.  Therefore, 

f(Yi,Y2, . .  .,Yn;a,P,a2)  =  (l/2vra2)n/2exp{- £?=i  (Tj  -  a  -  PX2)2 /2a2}  (3.8) 

Taking  the  log  of  this  likelihood,  we  get 

logL(a,  p,  a2)  =  -(n,/2)log(27Rj2)  -  J2i=i(Yi  -a-  PX2)2 /2a2  (3.9) 

Maximizing  this  likelihood  with  respect  to  a,  P  and  a2  one  gets  the  maximum  likelihood  esti¬ 
mators  (MLE).  However,  only  the  second  term  in  the  log  likelihood  contains  a  and  P  and  that 
term  (without  the  negative  sign)  has  already  been  minimized  with  respect  to  a  and  P  in  (3.2) 
and  (3.3)  giving  us  the  OLS  estimators.  Hence,  olmle  =  Pols  and  Pmle  =  Pols ■  Similarly, 
by  differentiating  logL  with  respect  to  a2  and  setting  this  derivative  equal  to  zero  one  gets 
a2MLE  =  Y17=  l  e?/n>  see  problem  7.  Note  that  this  differs  from  s 2  only  in  the  divisor.  In  fact, 
E(a2MLE)  =  (n  —  2 )a2 /n  /  a2.  Hence,  a2MLE  is  biased  but  note  that  it  is  still  asymptotically 
unbiased. 

So  far,  the  gains  from  imposing  assumption  5  are  the  following:  The  likelihood  can  be  formed, 
maximum  likelihood  estimators  can  be  derived,  and  distributions  can  be  obtained  for  these 
estimators.  One  can  also  derive  the  Cramer-Rao  lower  bound  for  unbiased  estimators  of  the 
parameters  and  show  that  the  Pols  and  Pols  attain  this  bound  whereas  s2  does  not.  This 
derivation  is  postponed  until  Chapter  7.  In  fact,  one  can  show  following  the  theory  of  complete 
sufficient  statistics  that  Pols ,  Pols  and  s 2  are  minimum  variance  unbiased  estimators  for  a ,  P 
and  a2,  see  Chapter  2.  This  is  a  stronger  result  (for  Pols  and  Pols )  than  that  obtained  using 
the  Gauss-Markov  Theorem.  It  says,  that  among  all  unbiased  estimators  of  a  and  P,  the  OLS 
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estimators  are  the  best.  In  other  words,  our  set  of  estimators  include  now  all  unbiased  estimators 
and  not  just  linear  unbiased  estimators.  This  stronger  result  is  obtained  at  the  expense  of  a 
stronger  distributional  assumption,  i.e.,  normality.  If  the  distribution  of  the  disturbances  is  not 
normal,  then  OLS  is  no  longer  MLE.  In  this  case,  MLE  will  be  more  efficient  than  OLS  as 
long  as  the  distribution  of  the  disturbances  is  correctly  specified.  Some  of  the  advantages  and 
disadvantages  of  MLE  were  discussed  in  Chapter  2. 

We  found  the  distributions  of  aoLSt  Pols--  now  we  giye  that  of  s2.  In  Chapter  7,  it  is  shown 
that  Y^i=i  el/a2  is  a  chi-squared  with  (n  —  2)  degrees  of  freedom.  Also,  s2  is  independent  of 
o-OLS  and  Pols •  This  is  useful  for  test  of  hypotheses.  In  fact,  the  major  gain  from  assumption 
5  is  that  we  can  perform  test  of  hypotheses. 

Standardizing  the  normal  random  variable  PolSi  one  gets  z  =  ( Pols  ~  P)/ia2 /  Y^i=i  X1 )5  ~ 
N( 0, 1).  Also,  (n  —  2)s2 /a2  is  distributed  as  Xn-2-  Hence,  one  can  divide  z,  a  IV(0, 1)  random 
variable,  by  the  square  root  of  (n  —  2 )s2 /a2  divided  by  its  degrees  of  freedom  (n  —  2)  to 
get  a  f-statistic  with  (n  —  2)  degrees  of  freedom.  The  resulting  statistic  is  t0bs  =  ( Pols  ~~ 
P)/(s2/Zt  i^)5  ~  tn~ 2,  see  problem  8.  This  statistic  can  be  used  to  test  Hq]  P  =  P0,  versus 
P  P0,  where  P0  is  a  known  constant.  Under  Hq,  tQbs  can  be  calculated  and  its  value  can  be 
compared  to  a  critical  value  from  a  t-distribution  with  (n  —  2)  degrees  of  freedom,  at  a  specified 
critical  value  of  a.%.  Of  specific  interest  is  the  hypothesis  Ho\  P  =  0,  which  states  that  there  is 
no  linear  relationship  between  Yi  and  A/.  Under  Hq, 

tobs  =  PoLs/{s2IY,Ux>  =  PoLs/se(PoLS ) 

where  se(P0LS )  =  ( s 2 1  Y-A=\  ■  T'  \t0bs\  >  ta/2-,n-2i  then  Hq  is  rejected  at  the  a%  significance 

level.  ta/2-n-2  represents  a  critical  value  obtained  from  a  f-distribution  with  n  —  2  degrees  of 
freedom.  It  is  determined  such  that  the  area  to  its  right  under  a  tn- 2  distribution  is  equal  to 
a/2. 

Similarly  one  can  get  a  confidence  interval  for  P  by  using  the  fact  that,  Pr[— ta/2-n-2  <  tobs  < 
ta/2;n- 2]  =  l  —  a  and  substituting  for  tQbs  its  value  derived  above  as  ( Pols  ~  P)/s'e(PoLs)- 
Since  the  critical  values  are  known,  Pols  an<^  se(PoLs)  can  be  calculated  from  the  data,  the 
following  (1  —  a)%  confidence  interval  for  P  emerges 

Pols  ±  ta/2-,n-2se(PoLS )■ 

Tests  of  hypotheses  and  confidence  intervals  on  a  and  a2  can  be  similarly  constructed  using  the 
normal  distribution  of  Pols  and  the  Xn-2  distribution  of  (n  —  2 )s2 /a2 . 


3.6  A  Measure  of  Fit 

We  have  obtained  the  least  squares  estimates  of  a,  P  and  <r2  and  found  their  distributions 
under  normality  of  the  disturbances.  We  have  also  learned  how  to  test  hypotheses  regarding 
these  parameters.  Now  we  turn  to  a  measure  of  fit  for  this  estimated  regression  line.  Recall,  that 
ei  =  Y{  —  Yi  where  denotes  the  predicted  Yi  from  the  least  squares  regression  line  at  the  value 
Xi,  i.e.,  a0LS  +  PoLSXi ■  Using  the  fact  that  Ya=i  ei  =  we  deduce  that  YJi=i  Yi  =  Ya= 1  U, 
and  therefore,  Y  =  Y.  The  actual  and  predicted  values  of  Y  have  the  same  sample  mean,  see 
numerical  properties  (i)  and  (iii)  of  the  OLS  estimators  discussed  in  section  2.  This  is  true 
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as  long  as  there  is  a  constant  in  the  regression.  Adding  and  subtracting  Y  from  a,  we  get 
ei  =  yi  —  yi,  or  yi  =  et  +  yt.  Squaring  and  summing  both  sides: 


eiu  y'  =  Er=i  ei + e^i  yf + 2  eiu  eSi  =  e?=i  e? + Er=i  s?  (3-10) 

where  the  last  equality  follows  from  the  fact  that  yi  =  floLSxi  and  l  =  0-  I*1  fact) 


EEi  e^yi  =  E?=i  =  0 


means  that  the  OLS  residuals  are  uncorrelated  with  the  predicted  values  from  the  regression, 
see  numerical  properties  (ii)  and  (iv)  of  the  OLS  estimates  discussed  in  section  3.2.  In  other 
words,  (3.10)  says  that  the  total  variation  in  Yi,  around  its  sample  mean  Y  i.e.,  Ef=i  vh  can  be 

decomposed  into  two  parts:  the  first  is  the  regression  sums  of  squares  Ef=  i  J/f  =  Pols  £f=i  xi  > 
and  the  second  is  the  residual  sums  of  squares  YPi= l  ei  •  In  fact:  regressing  Y  on  a  constant 
yields  aoLS  =  U,  see  problem  2,  and  the  unexplained  residual  sums  of  squares  of  this  naive 
model  is 


EEi (Yi  -  Sols)2  =  Ef=i(^  -  E2  =  EEi  vl 

Therefore,  £”=1  yf  in  (3.10)  gives  the  explanatory  power  of  X  after  the  constant  is  fit. 

Using  this  decomposition,  one  can  define  the  explanatory  power  of  the  regression  as  the 
ratio  of  the  regression  sums  of  squares  to  the  total  sums  of  squares.  In  other  words,  define 
R 2  =  EEi  v'i /  Ef=i  Vi  and  this  value  is  clearly  between  0  and  1.  In  fact,  dividing  (3.10)  by 
Ef=i  yf  one  gets  R2  =  1  —  £f=i  ef/Ef=i  Vi  -  The  £f=1  ef  is  a  measure  of  misfit  which  was 
minimized  by  least  squares.  If  Ef=i  ef  is  large,  this  means  that  the  regression  is  not  explaining 
a  lot  of  the  variation  in  Y  and  hence,  the  R?  value  would  be  small.  Alternatively,  if  the  £f=1  ef 
is  small,  then  the  fit  is  good  and  R2  is  large.  In  fact,  for  a  perfect  fit,  where  all  the  observations 
lie  on  the  fitted  line,  1)  =  Yi  and  e*  =  0,  which  means  that  Ef=i  ef  =  0  and  R2  =  1.  The  other 
extreme  case  is  where  the  regression  sums  of  squares  £f=i  yf  =  0.  In  other  words,  the  linear 
regression  explains  nothing  of  the  variation  in  Yj.  In  this  case,  Ef=i  yf  =  £f=i  ef  and  R2  =  0. 
Note  that  since  £f=i  yf  =  0  implies  y*  =  0  for  every  i,  which  in  turn  means  that  Yi  =  Y  for 
every  i.  The  fitted  regression  line  is  a  horizontal  line  drawn  at  Y  =  Y,  and  the  independent 
variable  X  does  not  have  any  explanatory  power  in  a  linear  relationship  with  Y . 

Note  that  R2  has  two  alternative  meanings:  (i)  It  is  the  simple  squared  correlation  coefficient 
between  Yt  and  Yi,  see  problem  9.  Also,  for  the  simple  regression  case,  (ii)  it  is  the  simple 
squared  correlation  between  X  and  Y.  This  means  that  before  one  runs  the  regression  of  Y  on 
X ,  one  can  compute  r2y  which  in  turn  tells  us  the  proportion  of  the  variation  in  Y  that  will 
be  explained  by  X.  If  this  number  is  pretty  low,  we  have  a  weak  linear  relationship  between  Y 
and  X  and  we  know  that  a  poor  fit  will  result  if  Y  is  regressed  on  X.  It  is  worth  emphasizing 
that  R2  is  a  measure  of  linear  association  between  Y  and  X.  There  could  exist,  for  example,  a 
perfect  quadratic  relationship  between  X  and  Y,  yet  the  estimated  least  squares  line  through 
the  data  is  a  flat  line  implying  that  R2  =  0,  see  problem  3  of  Chapter  2.  One  should  also  be 
suspicious  of  least  squares  regressions  with  R2  that  are  too  close  to  1.  In  some  cases,  we  may 
not  want  to  include  a  constant  in  the  regression.  In  such  cases,  one  should  use  an  uncentered 
R?  as  a  measure  fit.  The  appendix  to  this  chapter  defines  both  centered  and  uncentered  R2  and 
explains  the  difference  between  them. 
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3.7  Prediction 


Let  us  now  predict  Yo  given  Xo-  Usually  this  is  done  for  a  time  series  regression,  where  the 
researcher  is  interested  in  predicting  the  future,  say  one  period  ahead.  This  new  observation  Yo 
is  generated  by  (3.1),  i.e. , 

Yo  =  ol  +  /LYo  +  uq  (3.11) 

What  is  the  Best  Linear  Unbiased  Predictor  (BLUP)  of  E(Yo)?  From  (3.11),  E(Yq)  =  a  +  (3Xq 
is  a  linear  combination  of  a  and  p.  Using  the  Gauss-Mar kov  result,  Yo  =  a ols  +  Polsxo  is 
BLUE  for  a+pX o  and  the  variance  of  this  predictor  of  E(Yq)  is  er2[(l/n)  +  (Xo  —  X)2/  x!\i 
see  problem  10.  But,  what  if  we  are  interested  in  the  BLUP  for  Yo  itself  ?  Yo  differs  from  E(Yo) 
by  uo,  and  the  best  predictor  of  uq  is  zero,  so  the  BLUP  for  Yo  is  still  Yo.  The  forecast  error  is 

To  -  U0  =  [To  -  E(Y0)]  +  [E(Y0)  -  Y0]  =u0+  [E(Y0)  -  Y0] 

where  u$  is  the  error  committed  even  if  the  true  regression  line  is  known,  and  E(Yq)  —  Yo  is 
the  difference  between  the  sample  and  population  regression  lines.  Hence,  the  variance  of  the 
forecast  error  becomes: 

var(uo)  +  var[£’(Y'o)  -  U0]  +  2cov[u0,  E(Y0)  -  U0]  =  cr2[l  +  (1/n)  +  (X0  -  X)2 /  Ya=i  xi\ 

This  says  that  the  variance  of  the  forecast  error  is  equal  to  the  variance  of  the  predictor  of 
E(Yq)  plus  the  var(uo)  plus  twice  the  covariance  of  the  predictor  of  E(Yo)  and  uq.  But,  this  last 
covariance  is  zero,  since  uq  is  a  new  disturbance  and  is  not  correlated  with  the  disturbances  in  the 
sample  upon  which  Y%  is  based.  Therefore,  the  predictor  of  the  average  consumption  of  a  $20,  000 
income  household  is  the  same  as  the  predictor  of  consumption  for  a  specific  household  whose 
income  is  $20,  000.  The  difference  is  not  in  the  predictor  itself  but  in  the  variance  attached 
to  it.  The  latter  variance  being  larger  only  by  a2,  the  variance  of  uq.  The  variance  of  the 
predictor  therefore,  depends  upon  a2,  the  sample  size,  the  variation  in  the  A’s,  and  how  far  Xq 
is  from  the  sample  mean  of  the  observed  data.  To  summarize,  the  smaller  a2  is,  the  larger  n 
and  YPd=i  xf  are)  and  the  closer  X$  is  to  X,  the  smaller  is  the  variance  of  the  predictor.  One 
can  construct  95%  confidence  intervals  to  these  predictions  for  every  value  of  Xq.  In  fact,  this 
is  ( a0LS  +  Polsx o)  ±  t.025;n-2{s[l  +  (1/n)  +  (X0  -  X)2 /  YJi=i  xi\ 5 }  where  s  replaces  a,  and 
t.025;n— 2  represents  the  2.5%  critical  value  obtained  from  a  f-distribution  with  n  —  2  degrees  of 
freedom.  Figure  3.5  shows  this  confidence  band  around  the  estimated  regression  line.  This  is  a 
hyperbola  which  is  the  narrowest  at  X  as  expected,  and  widens  as  we  predict  away  from  X . 


3.8  Residual  Analysis 

A  plot  of  the  residuals  of  the  regression  is  very  important.  The  residuals  are  consistent  estimates 
of  the  true  disturbances.  But  unlike  the  Ui  s,  these  e^’s  are  not  independent.  In  fact,  the  OLS 
normal  equations  (3.2)and  (3.3)  give  us  two  relationships  between  these  residuals.  Therefore, 
knowing  (n  —  2)  of  these  residuals  the  remaining  two  residuals  can  be  deduced.  If  we  had  the 
true  Ui  s,  and  we  plotted  them,  they  should  look  like  a  random  scatter  around  the  horizontal 
axis  with  no  specific  pattern  to  them.  A  plot  of  the  e^’s  that  shows  a  certain  pattern  like  a  set 
of  positive  residuals  followed  by  a  set  of  negative  residuals  as  shown  in  Figure  3.6(a)  may  be 
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y=a+iixi 


X 


Figure  3.5  95%  Confidence  Bands 


indicative  of  a  violation  of  one  of  the  5  assumptions  imposed  on  the  model,  or  simply  indicating 
a  wrong  functional  form.  For  example,  if  assumption  3  is  violated,  so  that  the  Ui  s  are  say 
positively  correlated,  then  it  is  likely  to  have  a  positive  residual  followed  by  a  positive  one,  and 
a  negative  residual  followed  by  a  negative  one,  as  observed  in  Figure  3.6(b).  Alternatively,  if 
we  fit  a  linear  regression  line  to  a  true  quadratic  relation  between  Y  and  X,  then  a  scatter 
of  residuals  like  that  in  Figure  3.6(c)  will  be  generated.  We  will  study  how  to  deal  with  this 
violation  and  how  to  test  for  it  in  Chapter  5. 

Large  residuals  are  indicative  of  bad  predictions  in  the  sample.  A  large  residual  could  be 
a  typo,  where  the  researcher  entered  this  observation  wrongly.  Alternatively,  it  could  be  an 
influential  observation,  or  an  outlier  which  behaves  differently  from  the  other  data  points  in 
the  sample  and  therefore,  is  further  away  from  the  estimated  regression  line  than  the  other 
data  points.  The  fact  that  OLS  minimizes  the  sum  of  squares  of  these  residuals  means  that  a 
large  weight  is  put  on  this  observation  and  hence  it  is  influential.  In  other  words,  removing  this 
observation  from  the  sample  may  change  the  estimates  and  the  regression  line  significantly.  For 
more  on  the  study  of  influential  observations,  see  Belsely,  Kuh  and  Welsch  (1980).  We  will  focus 
on  this  issue  in  Chapter  8  of  this  book. 


Figure  3.6  Positively  Correlated  Residuals 
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Figure  3.7  Residual  Variation  Growing  with  X 


One  can  also  plot  the  residuals  versus  the  Xf  s.  If  a  pattern  like  Figure  3.7  emerges,  this  could  be 
indicative  of  a  violation  of  assumption  2  because  the  variation  of  the  residuals  is  growing  with 
Xi  when  it  should  be  constant  for  all  observations.  Alternatively,  it  could  imply  a  relationship 
between  the  Xi  s  and  the  true  disturbances  which  is  a  violation  of  assumption  4. 

In  summary,  one  should  always  plot  the  residuals  to  check  the  data,  identify  influential  obser¬ 
vations,  and  check  violations  of  the  5  assumptions  underlying  the  regression  model.  In  the  next 
few  chapters,  we  will  study  various  tests  of  the  violation  of  the  classical  assumptions.  Most  of 
these  tests  are  based  on  the  residuals  of  the  model.  These  tests  along  with  residual  plots  should 
help  the  researcher  gauge  the  adequacy  of  his  or  her  model. 


Table  3.1  Simple  Regression  Computations 


OBS 

Consumption 

Vi 

Income 

Xi 

yi=Yi-Y 

Xi — Xi  X 

XiUi 

xi 

% 

e. 

1 

4.6 

5 

-1.9 

-2.5 

4.75 

6.25 

4.476190 

0.123810 

2 

3.6 

4 

-2.9 

-3.5 

10.15 

12.25 

3.666667 

-0.066667 

3 

4.6 

6 

-1.9 

-1.5 

2.85 

2.25 

5.285714 

-0.685714 

4 

6.6 

8 

0.1 

0.5 

0.05 

0.25 

6.904762 

-0.304762 

5 

7.6 

8 

1.1 

0.5 

0.55 

0.25 

6.904762 

0.695238 

6 

5.6 

7 

-0.9 

-0.5 

0.45 

0.25 

6.095238 

-0.495238 

7 

5.6 

6 

-0.9 

-1.5 

1.35 

2.25 

5.285714 

0.314286 

8 

8.6 

9 

2.1 

1.5 

3.15 

2.25 

7.714286 

0.885714 

9 

8.6 

10 

2.1 

2.5 

5.25 

6.25 

8.523810 

0.076190 

10 

9.6 

12 

3.1 

4.5 

13.95 

20.25 

10.142857 

-0.542857 

SUM 

6.5 

75 

0 

0 

42.5 

52.5 

65 

0 

MEAN 

6.5 

7.5 

6.5 
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Table  3.1  gives  the  annual  consumption  of  10  households  each  selected  randomly  from  a  group  of 
households  with  a  fixed  personal  disposable  income.  Both  income  and  consumption  are  measured 
in  $10,  000,  so  that  the  first  household  earns  $50,  000  and  consumes  $46,  000  annually.  It  is 
worthwhile  doing  the  computations  necessary  to  obtain  the  least  squares  regression  estimates 
of  consumption  on  income  in  this  simple  case  and  to  compare  them  with  those  obtained  from 
a  regression  package.  In  order  to  do  this,  we  first  compute  Y  =  6.5  and  X  =  7.5  and  form  two 
new  columns  of  data  made  up  of  yi  =  Yi  —  Y  and  Xi  =  Xi  —  X.  To  get  Pols  we  need  xiVii 
so  we  multiply  these  last  two  columns  by  each  other  and  sum  to  get  42.5.  The  denominator  of 
Pols  is  given  by  X^=i  xi-  This  is  why  we  square  the  Xi  column  to  get  x2  and  sum  to  obtain 
52.5.  Our  estimate  of  Pols  =  4:2.5/52.5  =  0.8095  which  is  the  estimated  marginal  propensity  to 
consume.  This  is  the  extra  consumption  brought  about  by  an  extra  dollar  of  disposable  income. 

Pols  =  Y  -  Pols*  =  6-5  “  (0.8095)(7.5)  =  0.4286 


This  is  the  estimated  consumption  at  zero  personal  disposable  income.  The  fitted  values  or 
predicted  values  from  this  regression  are  computed  from  Yi  =  Pols  +  PoLS^i  =  0.4286  + 
0.8095Xj  and  are  given  in  Table  3.1.  Note  that  the  mean  of  Yt  is  equal  to  the  mean  of  1/ 
confirming  one  of  the  numerical  properties  of  least  squares.  The  residuals  are  computed  from 
e*  =  Yj.  —  Yi  and  they  satisfy  e*  =  0-  T  is  left  t°  the  reader  to  verify  that  ]T/=1  aXi  =  0. 
The  residual  sum  of  squares  is  obtained  by  squaring  the  column  of  residuals  and  summing  it. 
This  gives  us  Yli= l  e;  =  2.495238.  This  means  that  s 2  =  X^=i  e;/(n  —  2)  =  0.311905.  Its  square 
root  is  given  by  s  =  0.558.  This  is  known  as  the  standard  error  of  the  regression.  In  this  case, 
the  estimated  var (Pols)  is  s2/S"=  ixi  =  0.311905/52.5  =  0.005941  and  the  estimated 


var(3)  =  s2 


1  X2 

»  +  EtTf 


=  0.311905 


(7.5)5 


1 

10  +  52.5 


=  0.365374 


Taking  the  square  root  of  these  estimated  variances,  we  get  the  estimated  standard  errors  of 
Pols  and  PQls  given  by  se{P0LS )  =  0.60446  and  se(pOLs )  =  0.077078. 

Since  the  disturbances  are  normal,  the  OLS  estimators  are  also  the  maximum  likelihood 
estimators,  and  are  normally  distributed  themselves.  For  the  null  hypothesis  H$-,P  =  0;  the 
observed  t-statistic  is 


tabs  =  ( Pols  ~  0 )/se(pOLS)  =  0.809524/0.077078  =  10.50 

and  this  is  highly  significant,  since  Pr[|ig|  >  10.5]  <  0.0001.  This  probability  can  be  obtained 
using  most  regression  packages.  It  is  also  known  as  the  p- value  or  probability  value.  It  shows 
that  this  t-value  is  highly  unlikely  and  we  reject  Hg  that  P  =  0.  Similarly,  the  null  hypothesis 
Hb0  ;a  =  0,  gives  an  observed  t-statistic  of  t0bs  =  ( Pols  —  0) / se(aoLs)  =  0.428571/0.604462  = 
0.709,  which  is  not  significant,  since  its  p-value  is  Pr[|tg |  >  0.709]  <  0.498.  Hence,  we  do  not 
reject  the  null  hypothesis  Hq  that  a  =  0. 

The  total  sum  of  squares  is  J2?=i  ul  =  X)it=i0'»  ~  ^)2  which  can  be  obtained  by  squaring  the 
Hi  column  in  Table  3.1  and  summing.  This  yields  y 2  =  36.9.  Also,  the  regression  sum  of 

squares  =  Y^i= i  Vi  =  Sr=i(T*  —  T)2  which  can  be  obtained  by  subtracting  Y  =  Y  =  6.5  from 
the  Yi  column,  squaring  that  column  and  summing.  This  yields  34.404762.  This  could  have  also 
been  obtained  as 


64  Chapter  3:  Simple  Linear  Regression 


EE,  Vi  =  Pols  EE  I  =  (0.809524)2(52.5)  =  34.404762. 

A  final  check  is  that  EEi  Vi  =  EEi  Vi  ~  EEi  ei  =  36.9  —  2.495238  =  34.404762  as  required. 

Recall,  that  R?  =  r2xy  =  (EEi  ®iyi)2/(EEi  *i)(£?=i  Vi)  =  (42.5)2/(52.5)(36.9)  =  0.9324. 
This  could  have  also  been  obtained  as  R2  =  1  —  (£"=i  ef  /  E^Li  yj)  =  1  —  (2.495238/36.9)  = 
0.9324,  or  as 

R2  =  r2yy  =  E”=i  fi/  EILi  y2  =  34.404762/36.9  =  0.9324. 

This  means  that  personal  disposable  income  explains  93.24%  of  the  variation  in  consumption. 
A  plot  of  the  actual,  predicted  and  residual  values  versus  time  is  given  in  Figure  3.8.  This  was 
done  using  E Views. 


Figure  3.8  Residual  Plot 


3.10  Empirical  Example 

Table  3.2  gives  (i)  the  logarithm  of  cigarette  consumption  (in  packs)  per  person  of  smoking  age 
(>  16  years)  for  46  states  in  1992,  (ii)  the  logarithm  of  real  price  of  cigarettes  in  each  state, 
and  (iii)  the  logarithm  of  real  disposable  income  per  capita  in  each  state.  This  is  drawn  from 
Baltagi  and  Levin  (1992)  study  on  dynamic  demand  for  cigarettes.  It  can  be  downloaded  as 
Cigarett.dat  from  the  Springer  web  site. 
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Table  3.2  Cigarette  Consumption  Data 


LNC:  log  of  consumption  (in  packs)  per  person  of  smoking  age  (>16) 
LNP:  log  of  real  price  (1983$/pack) 

LNY:  log  of  real  disposable  income  per-capita  (in  thousand  1983$) 


OBS 

STATE 

LNC 

LNP 

LNY 

1 

AL 

4.96213 

0.20487 

4.64039 

2 

AZ 

4.66312 

0.16640 

4.68389 

3 

AR 

5.10709 

0.23406 

4.59435 

4 

CA 

4.50449 

0.36399 

4.88147 

5 

CT 

4.66983 

0.32149 

5.09472 

6 

DE 

5.04705 

0.21929 

4.87087 

7 

DC 

4.65637 

0.28946 

5.05960 

8 

FL 

4.80081 

0.28733 

4.81155 

9 

GA 

4.97974 

0.12826 

4.73299 

10 

ID 

4.74902 

0.17541 

4.64307 

11 

IL 

4.81445 

0.24806 

4.90387 

12 

IN 

5.11129 

0.08992 

4.72916 

13 

IA 

4.80857 

0.24081 

4.74211 

14 

KS 

4.79263 

0.21642 

4.79613 

15 

I<Y 

5.37906 

-0.03260 

4.64937 

16 

LA 

4.98602 

0.23856 

4.61461 

17 

ME 

4.98722 

0.29106 

4.75501 

18 

MD 

4.77751 

0.12575 

4.94692 

19 

MA 

4.73877 

0.22613 

4.99998 

20 

MI 

4.94744 

0.23067 

4.80620 

21 

MN 

4.69589 

0.34297 

4.81207 

22 

MS 

4.93990 

0.13638 

4.52938 

23 

MO 

5.06430 

0.08731 

4.78189 

24 

MT 

4.73313 

0.15303 

4.70417 

25 

NE 

4.77558 

0.18907 

4.79671 

26 

NV 

4.96642 

0.32304 

4.83816 

27 

NH 

5.10990 

0.15852 

5.00319 

28 

NJ 

4.70633 

0.30901 

5.10268 

29 

NM 

4.58107 

0.16458 

4.58202 

30 

NY 

4.66496 

0.34701 

4.96075 

31 

ND 

4.58237 

0.18197 

4.69163 

32 

OH 

4.97952 

0.12889 

4.75875 

33 

OK 

4.72720 

0.19554 

4.62730 

34 

PA 

4.80363 

0.22784 

4.83516 

35 

RI 

4.84693 

0.30324 

4.84670 

36 

SC 

5.07801 

0.07944 

4.62549 

37 

SD 

4.81545 

0.13139 

4.67747 

38 

TN 

5.04939 

0.15547 

4.72525 

39 

TX 

4.65398 

0.28196 

4.73437 

40 

UT 

4.40859 

0.19260 

4.55586 

41 

VT 

5.08799 

0.18018 

4.77578 

42 

VA 

4.9306cj 

0.11818 

4.85490 

43 

WA 

4.66134 

0.35053 

4.85645 

44 

WV 

4.82454 

0.12008 

4.56859 

45 

WI 

4.83026 

0.22954 

4.75826 

46 

WY 

5.00087 

0.10029 

4.71169 

Data:  Cigarette  Consumption  of  46  States  in  1992 
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Table  3.3  Cigarette  Consumption  Regression 


Analysis  of  Variance 


Sum  of 

Mean 

Source 

DF 

Squares 

Square 

F  Value 

Prob  >  F 

Model 

1 

0.48048 

0.48048 

18.084 

0.0001 

Error 

44 

1.16905 

0.02657 

Root  MSE 

0.16300 

R-square 

0.2913 

Dep  Mean 

4.84784 

Adj  R-sq 

0.2752 

C.V. 

3.36234 

Parameter  Estimates 

Parameter 

Standard 

T  for  HO: 

Variable 

DF 

Estimate 

Error 

Parameter=0 

Prob  >  T 

INTERCEP 

1 

5.094108 

0.06269897 

81.247 

0.0001 

LNP 

1 

-1.198316 

0.28178857 

-4.253 

0.0001 

.4 

.2 

-3  .0 

1 

C/5 
<D 

*  -.2 
-.4 

-.6 

-.1  .0  .1  .2  .3  .4 

Log  of  Real  Price  ( 1983$/Pack) 

Figure  3.9  Residuals  Versus  LNP 

Table  3.3  gives  the  SAS  output  for  the  regression  of  logC  on  logP.  The  price  elasticity  of 
demand  for  cigarettes  in  this  simple  model  is  (dlogC/logP)  which  is  the  slope  coefficient.  This 
is  estimated  to  be  —1.198  with  a  standard  error  of  0.282.  This  says  that  a  10%  increase  in  real 
price  of  cigarettes  has  an  estimated  12%  drop  in  per  capita  consumption  of  cigarettes.  The  R 2 
of  this  regression  is  0.29,  s 2  is  given  by  the  Mean  Square  Error  of  the  regression  which  is  0.0266. 
Figure  3.9  plots  the  residuals  of  this  regression  versus  the  independent  variable,  while  Figure 
3.10  plots  the  predictions  along  with  the  95%  confidence  interval  band  for  these  predictions. 
One  observation  clearly  stands  out  as  an  influential  observation  given  its  distance  from  the 
rest  of  the  data  and  that  is  the  observation  for  Kentucky,  a  producer  state  with  very  low  real 
price.  This  observation  almost  anchors  the  straight  line  fit  through  the  data.  More  on  inffuential 
observations  in  Chapter  8. 
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.1  .2 
Log  of  Real  Price  (1983$/Pack) 


Figure  3.10  95%  Confidence  Band  for  Predicted  Values 


Problems 


1.  For  the  simple  regression  with  a  constant  Yi  =  a  +  (3X \  +  Mj,  given  in  equation  (3.1)  verify  the 
following  numerical  properties  of  the  OLS  estimator: 


£ILi  e4  =  0,  £”=i  eiXz  =  0,  £”=1  eM  =  0,  £”=1  Y  =  £”=1  Y 


2.  For  the  regression  with  only  a  constant  Yi  =  a  +  Ui  with  Ui  ~  IID(0,  cr2),  show  that  the  least 
squares  estimate  of  S  is  aoLS  =  Y,  var (Sols)  =  cr 2/n,  and  the  residual  sums  of  squares  is 

£”=i%2=Er=i(^-?)2- 

3.  For  the  simple  regression  without  a  constant  Y,:  =  (3Xi  +  m,  with  Ui  ~  IID(0,  cr2). 

(a)  Derive  the  OLS  estimator  of  [3  and  find  its  variance. 

(b)  What  numerical  properties  of  the  OLS  estimators  described  in  problem  1  still  hold  for  this 
model? 

(c)  derive  the  maximum  likelihood  estimator  of  (3  and  cr2  under  the  assumption  Ui  ~  IIN(0,  cr2). 

(d)  Assume  cr2  is  known.  Derive  the  Wald,  LM  and  LR  tests  for  Hq\  (3  =  1  versus  H\\  (3^1. 

4.  Use  the  fact  that  ^(E”=ixrui)2  =  £™=i  £j=i  Xi'xjE(ui'Uj):;  and  assumptions  2  and  3  to  prove 
equation  (3.6). 

5.  Using  the  regression  given  in  equation  (3.1): 

(a)  Show  that  a ols  =  ck  +  (/?  —  f3OLS)X  +  u;  and  deduce  that  E{aoLS )  =  ot. 

(b)  Using  the  fact  that  Pols  —  f3  =  £"=1  XjUj/  £”=i  a:2;  use  the  results  in  part  (a)  to  show  that 
var (aoLs)  =  <r2[{l/n)  +  (V2/ £:=1  x2)]  =  a2  £”=1  Xf/n  £”=1  z2. 

(c)  Show  that  aoLS  is  consistent  for  a. 

(d)  Show  that  cov(aoLS,  Pols)  =  — Vvar(/?0iS)  =  —  cr2X/  ££i  xj  .  This  means  that  the  sign 
of  the  covariance  is  determined  by  the  sign  of  X.  If  X  is  positive,  this  covariance  will  be 
negative.  This  also  means  that  if  Sols  is  over-estimated,  Pols  will  be  under-estimated. 
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6.  Using  the  regression  given  in  equation  (3.1): 

(a)  Prove  that  a0LS  =  E"=i  ^Yi  where  A  *  =  (1  /n)  -  Xwt  and  wt  =  Xi/ £"=1  xi  ■ 

(b)  Show  that  =  1  and  El=i  =  0- 

(c)  Prove  that  any  other  linear  estimator  of  a,  say  a  =  £(£  biYi  must  satisfy  £"=1  bi  =  1  and 
£"=1  biXi  =  0  for  a  to  be  unbiased  for  a. 

(d)  Let  bi  =  Xi  +  fi ;  show  that  £”=1  h  =  0  and  E™=i  f%x%  =  °- 

(e)  Prove  that  var(5)  =  cr2  £"=1  h\  =  ^  E£=i  +  a'2  EH  l  fi  =var(S0Ls)  +  o'2  E"=i  fi- 

7.  (a)  Differentiate  (3.9)  with  respect  to  a  and  p  and  show  that  cxmle  =  Sols,  Pmle  =  Pols- 
(b)  Differentiate  (3.9)  with  respect  to  cr2  and  show  that  ct2mle  =  E"=i  e1ln- 

8.  The  t-Statistic  in  a  Simple  Regression.  It  is  well  known  that  a  standard  normal  random  variable 
X(0, 1)  divided  by  a  square  root  of  a  chi-squared  random  variable  divided  by  its  degrees  of  freedom 
(x2/^)5  results  in  a  random  variable  that  is  t-distributed  with  v  degrees  of  freedom,  provided 
the  N( 0, 1)  and  the  y2  variables  are  independent,  see  Chapter  2.  Use  this  fact  to  show  that 

(3Ois-j0)/[v(Er=i*n*]~t»-2. 


9.  Relationship  Between  R2  and  r : 


xy 


(a)  Using  the  fact  that  R2  =  £i=1  Vi/  Ei=i  Vi  5  Vi  =  PoLSxii  and  /?ols  =  E»=i  ^2/i/  E*=i  , 
show  that  R2  =  r2y  where, 

rly  =  (EHi  *iVi?K  EIU  *?)(EHi  !/?)■ 

(b)  Using  the  fact  that  yt  =  yi  +  e*,  show  that  E"=i  ViUi  =  E”=i  Vi>  and  hence,  deduce  that 
^  =  (EHi  ySi)2/( EHi tfXEHi  S?)  *  equal  to  f?2. 

10.  Prediction.  Consider  the  problem  of  predicting  Yq  from  (3.11).  Given  Xq, 


(a)  Show  that  E(Y0)  =  a  +  PX0. 

(b)  Show  that  Y0  is  unbiased  for  E(Y0). 

(c)  Show  that  var(Yo)  =  v&r(aoLs)  +  XQvar(POLS)  +  2X0cov(aoLS,  Pols)-  Deduce  that  var(Yo) 

=  <72[(l/n)  +  (X0-X)2/£?=i®?]- 

(d)  Consider  a  linear  predictor  of  E(Yo),  say  Y0  =  £"=1  a-iYi,  show  that  £"=1  a*  =  1  and 
£?=i  aiXi  =  X0  for  this  predictor  to  be  unbiased  for  E(Y0). 

(e)  Show  that  the  var(F0)  =  o^EHi0?-  Minimize  £”=1  a2  subject  to  the  restrictions  given 
in  (d).  Prove  that  the  resulting  predictor  is  Yq  =  Pols  +  Polsx o  and  that  the  minimum 
variance  is  cr2[(l/n)  +  (Xq  -  X)2 /  £"=1  x2}. 


11.  Optimal  Weighting  of  Unbiased  Estimators.  This  is  based  on  Baltagi  (1995).  For  the  simple  re¬ 
gression  without  a  constant  Yj;  =  PXi  +  Ui,i  =  1,2, ...,  N-,  where  P  is  a  scalar  and  u,  ~  IID(0,  cr2) 
independent  of  X,.  Consider  the  following  three  unbiased  estimators  of  p: 

Pi  =  E"=i  XiYi/  EIU  xl  P2  =  Y/x 


P3  =  E?=i(*  -  x)(Yi  -  u)/£ r=i(^  -  *)2, 

where  X  =  £"=1  Xj/ra  and  F  =  £"=i  Fj/n. 

(a)  Show  that  cov(/31,/32)  =  var(/31)  >  0,  and  that  p12  =  (the  correlation  coefficient  of  px  and 
p2)  =  [var(/31)/var(/32)]s  with  0  <  p12  <  1.  Show  that  the  optimal  combination  of  P1  and 
P2,  given  by  P  =  aP1  +  (1  —  a)P2  where  — oo  <  a  <  oo  occurs  at  a*  =  1.  Optimality  here 
refers  to  minimizing  the  variance.  Hint:  Read  the  paper  by  Samuel-Cahn  (1994). 
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(b)  Similarly,  show  that  cov(/31,/33)  =  var(/31)  >  0,  and  that  p13  =  (the  correlation  coefficient  of 
P1  and  P3)  =  [var(/31)/var(/33)]s  =  (1  —  pf2)3  with  0  <  P13  <  1-  Conclude  that  the  optimal 
combination  P1  and  P3  is  again  a*  =  1. 

(c)  Show  that  cov(/?2,/?3)  =  0  and  that  optimal  combination  of  P2  and  /?3  is  /?  =  (1  —  p22)/33  + 
P12P2  =  Pi-  This  exercise  demonstrates  a  more  general  result,  namely  that  the  BLUE  of 
P  in  this  case  P3,  has  a  positive  correlation  with  any  other  linear  unbiased  estimator  of  /3, 
and  that  this  correlation  can  be  easily  computed  from  the  ratio  of  the  variances  of  these  two 
estimators. 

12.  Efficiency  as  Correlation.  This  is  based  on  Oksanen  (1993).  Let  p  denote  the  Best  Linear  Unbiased 
Estimator  of  P  and  let  P  denote  any  linear  unbiased  estimator  of  p.  Show  that  the  relative  efficiency 
of  P  with  respect  to  P  is  the  squared  correlation  coefficient  between  P  and  p.  Hint:  Compute  the 

variance  of  P  +  A (P  —  p)  for  any  A.  This  variance  is  minimized  at  A  =  0  since  P  is  BLUE.  This 

^2  — 

should  give  you  the  result  that  E(P  )  =  E(PP)  which  in  turn  proves  the  required  result,  see  Zheng 
(1994). 

13.  For  the  numerical  illustration  given  in  section  3.9,  what  happens  to  the  least  squares  regression 
coefficient  estimates  (Pols,  Pols),  s2,  the  estimated  se(aoLs)  and  se(P0LS ),  ^-statistic  for  Pols 
and  Pols  f°r  -®o i a  =  0,  and  H3\ P  =  0  and  R2  when: 

(a)  Yi  is  regressed  on  Xi  +  5  rather  than  X,;.  In  other  words,  we  add  a  constant  5  to  each 
observation  of  the  explanatory  variable  Xi  and  rerun  the  regression.  It  is  very  instructive  to 
see  how  the  computations  in  Table  3.1  are  affected  by  this  simple  transformation  on  X?> 

(b)  Yi  +  2  is  regressed  on  X,.  In  other  words,  a  constant  2  is  added  to  Y). 

(c)  Yi  is  regressed  on  2Xj.  (A  constant  2  is  multiplied  by  Xi). 

14.  For  the  cigarette  consumption  data  given  in  Table  3.2. 

(a)  Give  the  descriptive  statistics  for  logC,  logP  and  logU.  Plot  their  histogram.  Also,  plot  logC 
versus  logF  and  logC  versus  logP.  Obtain  the  correlation  matrix  of  these  variables. 

(b)  Run  the  regression  of  logC  on  logU.  What  is  the  income  elasticity  estimate?  What  is  its 
standard  error?  Test  the  null  hypothesis  that  this  elasticity  is  zero.  What  is  the  s  and  R 2  of 
this  regression? 

(c)  Show  that  the  square  of  the  simple  correlation  coefficient  between  logC  and  logU  is  equal  to 
R2.  Show  that  the  square  of  the  correlation  coefficient  between  the  fitted  and  actual  values 
of  logC  is  also  equal  to  R2. 

(d)  Plot  the  residuals  versus  income.  Also,  plot  the  fitted  values  along  with  their  95%  confidence 
band. 

15.  Consider  the  simple  regression  with  no  constant:  Yj  =  pXi  +Ui  i  =  1, 2, . . . ,  n 

where  Ui  ~  IID(0,  cr2)  independent  of  Xi.  Theil  (1971)  showed  that  among  all  linear  estimators  in 
Yi,  the  minimum  mean  square  estimator  for  P,  i.e.,  that  which  minimizes  E(P  —  P)2  is  given  by 

P  =  P2  E”=i  XiYiKP2  £:=1  X2  +  a2). 

(a)  Show  that  E(p)  =  P/(  1  +  c),  where  c  =  cr2//?2  E”=i  >  0- 

(b)  Conclude  that  the  Bias  (P)  =  E(P)  —  P  =  —  [c/(l  +  c)\p.  Note  that  this  bias  is  positive 
(negative)  when  P  is  negative  (positive).  This  also  means  that  p  is  biased  towards  zero. 

(c)  Show  that  MSE(/3)  =  E(P  —  p)2  =  cr2/[J]/=1  X2  +  (cr2//?2)] .  Conclude  that  it  is  smaller  than 

the  MSE(30ls)- 
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Table  3.4  Energy  Data  for  20  countries 


Country 

RGDP 

(in  106  1975  U.S.S’s) 

EN 

106  Kilograms  Coal  Equivalents 

Malta 

1251 

456 

Iceland 

1331 

1124 

Cyprus 

2003 

1211 

Ireland 

11788 

11053 

Norway 

27914 

26086 

Finland 

28388 

26405 

Portugal 

30642 

12080 

Denmark 

34540 

27049 

Greece 

38039 

20119 

Switzerland 

42238 

23234 

Austria 

45451 

30633 

Sweden 

59350 

45132 

Belgium 

62049 

58894 

Netherlands 

82804 

84416 

Turkey 

91946 

32619 

Spain 

159602 

88148 

Italy 

265863 

192453 

U.K. 

279191 

268056 

France 

358675 

233907 

W.  Germany 

428888 

352.677 

16.  Table  3.4  gives  cross-section  Data  for  1980  on  real  gross  domestic  product  (RGDP)  and  aggregate 
energy  consumption  (EN)  for  20  countries 

(a)  Enter  the  data  and  provide  descriptive  statistics.  Plot  the  histograms  for  RGDP  and  EN. 
Plot  EN  versus  RGDP. 

(b)  Estimate  the  regression: 

log(£Vi)  =  a  +  /3\og(RGDP)  +  u. 

Be  sure  to  plot  the  residuals.  What  do  they  show? 

(c)  Test  H0;  /3  =  1. 

(d)  One  of  your  Energy  data  observations  has  a  misplaced  decimal.  Multiply  it  by  1000.  Now 
repeat  parts  (a),  (b)  and  (c). 

(e)  Was  there  any  reason  for  ordering  the  data  from  the  lowest  to  highest  energy  consumption? 
Explain. 

Lesson  Learned :  Always  plot  the  residuals.  Always  check  your  data  very  carefully. 

17.  Using  the  Energy  Data  given  in  Table  3.4,  corrected  as  in  problem  16  part  (d),  is  it  legitimate  to 
reverse  the  form  of  the  equation? 

log(RDGP)  =  7  +  6log(En)  +  e 

(a)  Economically,  does  this  change  the  interpretation  of  the  equation?  Explain. 

(b)  Estimate  this  equation  and  compare  R2  of  this  equation  with  that  of  the  previous  problem. 
Also,  check  if  6  =  1//?.  Why  are  they  different? 
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(c)  Statistically,  by  reversing  the  equation,  which  assumptions  do  we  violate? 

(d)  Show  that  S/3  =  R2. 

(e)  Effects  of  changing  units  in  which  variables  are  measured.  Suppose  you  measured  energy  in 
BTU’s  instead  of  kilograms  of  coal  equivalents  so  that  the  original  series  was  multiplied  by 
60.  How  does  it  change  a  and  /3  in  the  following  equations? 

log(-En)  =  a  +  /31og  (RDGP)  +  u  En  =  a*  +  (3*RGDP  +  v 

Can  you  explain  why  a  changed,  but  not  (3  for  the  log-log  model,  whereas  both  S*and 
/ 3  changed  for  the  linear  model? 

(f)  For  the  log-log  specification  and  the  linear  specification,  compare  the  GDP  elasticity  for 
Malta  and  W.  Germany.  Are  both  equally  plausible? 

(g)  Plot  the  residuals  from  both  linear  and  log-log  models.  What  do  you  observe? 

(h)  Can  you  compare  the  R 2  and  standard  errors  from  both  models  in  part  (g)?  Hint:  Retrieve 
log  (En)  and  log(En)  in  the  log-log  equation,  exponentiate,  then  compute  the  residuals  and 
s.  These  are  comparable  to  those  obtained  from  the  linear  model. 

18.  For  the  model  considered  in  problem  16:  log  (En)  =  a  +  (3\og(RGDP)  +  u  and  measuring  energy 
in  BTU’s  (like  part  (e)  of  problem  17). 

(a)  What  is  the  95%  confidence  prediction  interval  at  the  sample  mean? 

(b)  What  is  the  95%  confidence  prediction  interval  for  Malta? 

(c)  What  is  the  95%  confidence  prediction  interval  for  West  Germany? 
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Appendix 

Centered  and  Uncentered  R2 


From  the  OLS  regression  on  (3.1)  we  get 

Y  =  Yi  +  a  i  =  1,2,  ...,ra 

where  Y  =  aoLS  +  X i/30LS •  Squaring  and  summing  the  above  equation  we  get 

_  sr^n  t>2  ,  2 

Z_/i=l  "T  Z^i=l 

since  EEi  Yei  =  0-  The  uncentered  R 2  is  given  by 

uncentered  R2  =  1  -  EEl  e2/  EEi  Y2  =  E?=i  Y2/  E,=i  Y2 


(A.1) 


(A.2) 


(A.3) 


Note  that  the  total  sum  of  squares  for  Yi  is  not  expressed  in  deviation  from  the  sample  mean  Y. 
In  other  words,  the  uncentered  R 2  is  the  proportion  of  variation  of  E”=i  Y;2  that  is  explained 
by  the  regression  on  X.  Regression  packages  usually  report  the  centered  R2  which  was  defined 
in  section  3.6  as  1  —  (E?=i  e?/  EEi  Vi)  where  yi  =  Yi  —  Y.  The  latter  measure  focuses  on 
explaining  the  variation  in  Yi  after  fitting  the  constant. 

From  section  3.6,  we  have  seen  that  a  naive  model  with  only  a  constant  in  it  gives  Y  as  the 
estimate  of  the  constant,  see  also  problem  2.  The  variation  in  1)  that  is  not  explained  by  this 
naive  model  is  EEl  u/  =  EEi (Y  —  Y)2.  Subtracting  nY 2  from  both  sides  of  (A.2)  we  get 


V"  2  _  y-n 
x,i= i  y%  —  Xii= i 


Y2 


nY2 + Er=i « 


and  the  centered  R2  is 


centered  R2  =  1  -  (EEi  4/  EEi  vf)  =  (EE i  Y2  ~  nY2)/ EEl  Vi  (A.4) 

If  there  is  a  constant  in  the  model  Y  =  Y,  see  section  3.6,  and  E"=i  Vi  =  EEi(Y  —  Y)2  = 
EEi  Yj2  —  nY2.  Therefore,  the  centered  i?2  =  EEl  Y  /  E"=i  Y  which  is  the  i?2  reported  by 
regression  packages.  If  there  is  no  constant  in  the  model,  some  regression  packages  give  you  the 
option  of  (no  constant)  and  the  R2  reported  is  usually  the  uncentered  R2.  Check  your  regression 
package  documentation  to  verify  what  you  are  getting.  We  will  encounter  uncentered  R2  again 
in  constructing  test  statistics  using  regressions,  see  for  example  Chapter  11. 


CHAPTER  4 


Multiple  Regression  Analysis 


4.1  Introduction 


So  far  we  have  considered  only  one  regressor  X  besides  the  constant  in  the  regression  equation. 
Economic  relationships  usually  include  more  than  one  regressor.  For  example,  a  demand  equa¬ 
tion  for  a  product  will  usually  include  real  price  of  that  product  in  addition  to  real  income  as 
well  as  real  price  of  a  competitive  product  and  the  advertising  expenditures  on  this  product.  In 
this  case 


(4.1) 


Yi  —  a  +  f32X2i  +  P3X31  +  ..  +  +  Ui  i  —  1,2,.. 


•  ,n 


where  Yi  denotes  the  i-th  observation  on  the  dependent  variable  Y,  in  this  case  the  sales  of 


this  product.  Xk/I  denotes  the  i-th  observation  on  the  independent  variable  Xk  for  k  =  2, . . . ,  K 
in  this  case,  own  price,  the  competitor’s  price  and  advertising  expenditures,  a  is  the  intercept 
and  /32,  /33,  ■  •  • ,  /3k  are  the  (K  —  1)  slope  coefficients.  The  «j’s  satisfy  the  classical  assumptions 
1-4  given  in  Chapter  3.  Assumption  4  is  modified  to  include  all  the  X's  appearing  in  the 
regression,  i.e.,  every  X k  for  k  =  2, . . . ,  K ,  is  uncorrelated  with  the  Ui  s  with  the  property  that 
X)i!=i (Afcj  —  Xk)2/n  where  Xk  =  'YHi-iXki/n  has  a  finite  probability  limit  which  is  different 
from  zero. 

Section  4.2  derives  the  OLS  normal  equations  of  this  multiple  regression  model  and  discovers 
that  an  additional  assumption  is  needed  for  these  equations  to  yield  a  unique  solution. 

4.2  Least  Squares  Estimation 

As  explained  in  Chapter  3,  least  squares  minimizes  the  residual  sum  of  squares  where  the 
residuals  are  now  given  by  e*  =  Yi  —  a  —  X)fc=2  Pk^ki  and  a  and  (3k  denote  guesses  on  the 
regression  parameters  a  and  /3k,  respectively.  The  residual  sum  of  squares 


RSS  =  £?=1  e ?  =  YJi=i (yi  -  S  -  p2x2i  -  ..  -  0KXKi)2 


is  minimized  by  the  following  K  first-order  conditions: 


(4.2) 


or,  equivalently 


(4.3) 


B.H.  Baltagi,  Econometrics,  Springer  Texts  in  Business  and  Economics,  DOI  10.1007/978-3-642-20059-5  4, 
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where  the  first  equation  multiplies  the  regression  equation  by  the  constant  and  sums,  the  second 
equation  multiplies  the  regression  equation  by  A2  and  sums,  and  the  K- th  equation  multiplies 
the  regression  equation  by  Xk  and  sums.  £™=1  ui  =  0  and  Yl?=  1  ui^ki  =  0  for  k  =  2, . . . ,  K 
are  implicitly  imposed  to  arrive  at  (4.3).  Solving  these  K  equations  in  K  unknowns,  we  get  the 
OLS  estimators.  This  can  be  done  more  succinctly  in  matrix  form,  see  Chapter  7.  Assumptions 
1-4  insure  that  the  OLS  estimator  is  BLUE.  Assumption  5  introduces  normality  and  as  a  result 
the  OLS  estimator  is  also  (i)  a  maximum  likelihood  estimator,  (ii)  it  is  normally  distributed, 
and  (iii)  it  is  minimum  variance  unbiased.  Normality  also  allows  test  of  hypotheses.  Without 
the  normality  assumption,  one  has  to  appeal  to  the  Central  Limit  Theorem  and  the  fact  that 
the  sample  is  large  to  perform  hypotheses  testing. 

In  order  to  make  sure  we  can  solve  for  the  OLS  estimators  in  (4.3)  we  need  to  impose  one 
further  assumption  on  the  model  besides  those  considered  in  Chapter  3. 

Assumption  6:  No  perfect  multicollinearity ,  i.e. ,  the  explanatory  variables  are  not  perfectly 
correlated  with  each  other.  This  assumption  states  that,  no  explanatory  variable  X for  k  = 
2 , ,K  is  a  perfect  linear  combination  of  the  other  A’s.  If  assumption  6  is  violated,  then 
one  of  the  equations  in  (4.2)  or  (4.3)  becomes  redundant  and  we  would  have  K  —  1  linearly 
independent  equations  in  K  unknowns.  This  means  that  we  cannot  solve  uniquely  for  the  OLS 
estimators  of  the  K  coefficients. 

Example  1:  If  X2j  =  3A4j  —  2X^  +  X 7i  for  i  =  1, . . . ,  n,  then  multiplying  this  relationship  by 
Ci  and  summing  over  i  we  get 

E”=,  ^2 i.ei  =  3  £?=i  XAiei  -  2  £  ?=lX5iei  +  £”=1  X7ia. 

This  means  that  the  second  OLS  normal  equation  in  (4.2)  can  be  represented  as  a  perfect  linear 
combination  of  the  fourth,  fifth  and  seventh  OLS  normal  equations.  Knowing  the  latter  three 
equations,  the  second  equation  adds  no  new  information.  Alternatively,  one  could  substitute 
this  relationship  in  the  original  regression  equation  (4.1).  After  some  algebra,  X2  would  be 
eliminated  and  the  resulting  equation  becomes: 

Yi  =  a  +  P^X^  +  (3/32  +  /34)A4j  +  (/35  —  2(32)X7>  i  +  PqXqj,  +  (/32  +  (57)X7i  (4-4) 

+..  +  (3KXxi  +  Ui . 

Note  that  the  coefficients  of  X4j,  X&  and  X7 i  are  now  (3 (32  +  PA),  (/?5  —  2(32)  and  (/ 32  +  /37), 
respectively.  All  of  which  are  contaminated  by  (32 ■  These  linear  combinations  of  /32,  (3A,  f35  and 
/3r  can  be  estimated  from  regression  (4.4)  which  excludes  A^.  In  fact,  the  other  X’s,  not  con¬ 
taminated  by  this  perfect  linear  relationship,  will  have  coefficients  that  are  not  contaminated 
by  (32  an(4  hence  are  themselves  estimable  using  OLS.  However,  (32,  /?4,  /35  and  (37  cannot  be 
estimated  separately.  Perfect  multicollinearity  means  that  we  cannot  separate  the  influence  on 
Y  of  the  independent  variables  that  are  perfectly  related.  Hence,  assumption  6  of  no  perfect 
multicollinearity  is  needed  to  guarantee  a  unique  solution  of  the  OLS  normal  equations.  Note 
that  it  applies  to  perfect  linear  relationships  and  does  not  apply  to  perfect  non-linear  relation¬ 
ships  among  the  independent  variables.  In  other  words,  one  can  include  Xu  and  X7i  like  (years 
of  experience)  and  (years  of  experience)2  in  an  equation  explaining  earnings  of  individuals.  Al¬ 
though,  there  is  a  perfect  quadratic  relationship  between  these  independent  variables,  this  is 
not  a  perfect  linear  relationship  and  therefore,  does  not  cause  perfect  multicollinearity. 
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4.3  Residual  Interpretation  of  Multiple  Regression  Estimates 

Although  we  did  not  derive  an  explicit  solution  for  the  OLS  estimators  of  the  f3’s,  we  know 
that  they  are  the  solutions  to  (4.2)  or  (4.3).  Let  us  focus  on  one  of  these  estimators,  say  /32,  the 
OLS  estimator  of  /32,  the  partial  derivative  of  1),  with  respect  to  X2{.  As  a  solution  to  (4.2)  or 
(4.3),  P2  is  a  multiple  regression  coefficient  estimate  of  /32.  Alternatively,  we  can  interpret  /32 
as  a  simple  linear  regression  coefficient. 

Claim  1:  (i)  Run  the  regression  of  X2  on  all  the  other  X's  in  (4.1),  and  obtain  the  residuals 
u2,  i.e. ,  X2  =  X2  +  v2.  (ii)  Run  the  simple  regression  of  Y  on  z?2,  the  resulting  estimate  of  the 
slope  coefficient  is  (32. 

The  first  regression  essentially  cleans  out  the  effect  of  the  other  A’s  from  X2,  leaving  the  vari¬ 
ation  unique  to  X2  in  v2 .  Claim  1  states  that  (32  can  be  interpreted  as  a  simple  linear  regression 
coefficient  of  Y  on  this  residual.  This  is  in  line  with  the  partial  derivative  interpretation  of  (32. 
The  proof  of  claim  1  is  given  in  the  Appendix.  Using  the  results  of  the  simple  regression  given 
in  (3.4)  with  the  regressor  Xt  replaced  by  the  residual  v2.  we  get 


32=£r=i^/£r=i4 

(4.5) 

and  from  (3.6)  we  get 

var(32)  =  a2/£7=14 

(4.6) 

An  alternative  interpretation  of  /32  as  a  simple  regression  coefficient  is  the  following: 

Claim  2:  (i)  Run  Y  on  all  the  other  X’s  and  get  the  predicted  Y  and  the  residuals,  say  u.  (ii) 
Run  the  simple  linear  regression  of  u  on  D2.  (32  is  the  resulting  estimate  of  the  slope  coefficient. 

This  regression  cleans  both  Y  and  X2  from  the  effect  of  the  other  X's  and  then  regresses  the 
cleaned  out  residuals  of  Y  on  those  of  X2 .  Once  again  this  is  in  line  with  the  partial  derivative 
interpretation  of  /32.  The  proof  of  claim  2  is  simple  and  is  given  in  the  Appendix. 

These  two  interpretations  of  /32  are  important  in  that  they  provide  an  easy  way  of  looking  at 
a  multiple  regression  in  the  context  of  a  simple  linear  regression.  Also,  it  says  that  there  is  no 
need  to  clean  the  effects  of  one  X  from  the  other  A’s  to  find  its  unique  effect  on  Y.  All  one  has 
to  do  is  to  include  all  these  X’s  in  the  same  multiple  regression.  Problem  1  verifies  this  result 
with  an  empirical  example.  This  will  also  be  proved  using  matrix  algebra  in  Chapter  7. 

Recall  that  R2  =  1  —  RSS/TSS  for  any  regression.  Let  R2  be  the  R2  for  the  regression 
of  X2  on  all  the  other  X’s,  then  R2  =  1  —  £(£  P2i/ £™=1  4  where  x2 *  =  X2 j  —  X2  and 
X2  =  Y!i=ix2i/n]  TSS  =  (x'2i  -  x2)2  =  £"=i4  and  RSS  =  zU=i  4-  Equivalently, 

£44  =  £44(1- 4)  and  the 

var(32)  =  a2/ £?=i  u22i  =  a2 1  E?=i  4(1  -  4)  (4-7) 

This  means  that  the  larger  R2,  the  smaller  is  (1  —  R2)  and  the  larger  is  var (/32)  holding  cr2 
and  E?=i  x2i  fixed.  This  shows  the  relationship  between  multicollinearity  and  the  variance  of 
the  OLS  estimates.  High  multicollinearity  between  X2  and  the  other  X's  will  result  in  high 
R2  which  in  turn  implies  high  variance  for  (32.  Perfect  multicollinearity  is  the  extreme  case 
where  R2  =  1.  This  in  turn  implies  an  infinite  variance  for  (32.  In  general,  high  multicollinearity 
among  the  regressors  yields  imprecise  estimates  for  these  highly  correlated  variables.  The  least 
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squares  regression  estimates  are  still  unbiased  as  long  as  assumptions  1  and  4  are  satisfied, 
but  these  estimates  are  unreliable  as  reflected  by  their  high  variances.  However,  it  is  important 
to  note  that  a  low  a2  and  a  high  Yh  F=i4  could  counteract  the  effect  of  a  high  R2  leading 
to  a  significant  f-statistic  for  p2.  Maddala  (2001)  argues  that  high  intercorrelation  among  the 
explanatory  variables  are  neither  necessary  nor  sufficient  to  cause  the  multicollinearity  problem. 
In  practice,  multicollinearity  is  sensitive  to  the  addition  or  deletion  of  observations.  More  on 
this  in  Chapter  8.  Looking  at  high  intercorrelations  among  the  explanatory  variables  is  useful 
only  as  a  complaint.  It  is  more  important  to  look  at  the  standard  errors  and  f-statistics  to  assess 
the  seriousness  of  multicollinearity. 

Much  has  been  written  on  possible  solutions  to  the  multicollinearity  problem,  see  Hill  and 
Adkins  (2001)  for  a  good  summary.  Credible  candidates  include:  (i)  obtaining  new  and  better 
data ,  but  this  is  rarely  available;  (ii)  introducing  nonsample  information  about  the  model  pa¬ 
rameters  based  on  previous  empirical  research  or  economic  theory.  The  problem  with  the  latter 
solution  is  that  we  never  truly  know  whether  the  information  we  introduce  is  good  enough  to 
reduce  estimator  Mean  Square  Error. 


4.4  Overspecification  and  Underspecification  of  the  Regression 
Equation 

So  far  we  have  assumed  that  the  true  linear  regression  relationship  is  always  correctly  specified. 
This  is  likely  to  be  violated  in  practice.  In  order  to  keep  things  simple,  we  consider  the  case 
where  the  true  model  is  a  simple  regression  with  one  regressor  X\ . 


True  model:  T)  =  a  +  (3 iXu  +  Ui 

with  m  ~  IID(0,  cr2),  but  the  estimated  model  is  overspecified  with  the  inclusion  of  an  additional 
irrelevant  variable  X2l  i.e., 

Estimated  model:  Y)  =  a  +  PiXu  +  P2X2i 

From  the  previous  section,  it  is  clear  that  /31  =  EAi  ^liYi/  EI*=i  4  where  ^1  is  the  OLS 
residuals  of  X\  on  X2.  Substituting  the  true  model  for  Y  we  get 

A  =  Pi  E’Li  *  iXu/  E -=i  4  +  EILi  Si  i«i/  E?=i  ^ 

since  Ya=i  "u  =  0.  But,  XH  =  Xu  +  uu  and  Ya=i  x\X\i  =  0  implying  that  Ya=i  ^uxu  = 
Eti Ai-  Hence, 

Pi  =  Pi  +  EEi  Si iUi/ E"=i  "ii  (4-8) 


and  E(/3i)  =  Pi  since  v\  is  a  linear  combination  of  the  X's,  and  E(Xi~u)  =  0  for  k  =  1,2.  Also, 

var(^)  =  a2/ =  a3/ Eli  4(1  ~  4)  (4-9) 

where  xu  =  Xu  —  X\  and  R\  is  the  R2  of  the  regression  of  X\  on  X2.  Using  the  true  model 
to  estimate  P1,  one  would  get  b\  =  EEi  xuyi/  EI=i  xu  whh  E(b\)  =  Pi  and  var(6i)  = 
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a2/  Y^i=i  xi i-  Hence,  var(/31)  >  var(&i).  Note  also  that  in  the  overspecified  model,  the  estimate 
for  /32  which  has  a  true  value  of  zero  is  given  by 

02  =  Emails/ ££=!?*  (4-10) 

where  D2  is  the  OLS  residual  of  X2  on  X\.  Substituting  the  true  model  for  Y  we  get 

=  (4.11) 

since  Ya=i  ^2iXu  =  0  and  Y17=  1  4 i  =  0.  Hence,  E{(32)  =  0  since  U2  is  a  linear  combination 
of  the  X’s  and  E^X^u)  =  0  for  k  =  1,2.  In  summary,  overspecification  still  yields  unbiased 
estimates  of  /31  and  /32,  but  the  price  is  a  higher  variance. 

Similarly,  the  true  model  could  be  a  two-regressors  model 

True  model:  Yi  =  a  +  PiXu  +  82x2i  +  Ui 

where  Ui  ~  IID(0,o-2)  but  the  estimated  model  is 

Estimated  model:  Y)  =  a  + 

The  estimated  model  omits  a  relevant  variable  X2  and  underspecifies  the  true  relationship.  In 
this  case 


3i  =  E"=i  ®i*V  £7i  xli  (4-12) 

where  xu  =  Xu  —  X\.  Substituting  the  true  model  for  Y  we  get 

01  =  01  +  02  E”=i  xuX2i/  E"=1  4  +  E?=i  *1  .*«*/  £7i  4  (4.13) 

Hence,  E(/?1)  =  (3l  +  /32&i2  since  E{x\u)  =  0  with  612  =  £”=1  Xux2i/ £”=i  4-  Note  that  612 
is  the  regression  slope  estimate  obtained  by  regressing  X2  on  X\  and  a  constant.  Also,  the 

var(^)  =  E01  -  E0,))2  =  4£7i  xi i«i/  £7i  4)2  =  <4  £”=1  4 

which  understates  the  variance  of  the  estimate  of  /31  obtained  from  the  true  model,  i.e. ,  b±  = 
£744  £4  4  with 

var(6r)  =  4/£7i4  =  4£7i4(l  ~  4)  >  varfo).  (4.14) 

In  summary,  underspecification  yields  biased  estimates  of  the  regression  coefficients  and  under¬ 
states  the  variance  of  these  estimates.  This  is  also  an  example  of  imposing  a  zero  restriction 
on  j32  when  in  fact  it  is  not  true.  This  introduces  bias,  because  the  restriction  is  wrong,  but 
reduces  the  variance  because  it  imposes  more  information  even  if  this  information  may  be  false. 
We  will  encounter  this  general  principle  again  when  we  discuss  distributed  lags  in  Chapter  6. 
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4.5  R-Squared  Versus  R-Bar-Squared 

Since  OLS  minimizes  the  residual  sums  of  squares,  adding  one  or  more  variables  to  the  regres¬ 
sion  cannot  increase  this  residual  sums  of  squares.  After  all,  we  are  minimizing  over  a  larger 
dimension  parameter  set  and  the  minimum  there  is  smaller  or  equal  to  that  over  a  subset  of  the 
parameter  space,  see  problem  4.  Therefore,  for  the  same  dependent  variable  Y .  adding  more  vari¬ 
ables  makes  e 1  non-increasing  and  R 2  non-decreasing,  since  R2  =  1  —  (E”=i  e?  /  EE  i  Vi)- 
Hence,  a  criteria  of  selecting  a  regression  that  “maximizes  R 2”  does  not  make  sense,  since  we 
can  add  more  variables  to  this  regression  and  improve  on  this  R 2  (or  at  worst  leave  it  the  same) . 
In  order  to  penalize  the  researcher  for  adding  an  extra  variable,  one  computes 

R2  =  1  -  [EEi  e?/(n  -  K)\/[ EEi  yf/(n  ~  1)]  (4.15) 

where  EEi  e?  an(i  EEi  Vi  have  been  adjusted  by  their  degrees  of  freedom.  Note  that  the 
numerator  is  the  s 2  of  the  regression  and  is  equal  to  EEi  e?/(n  —  R)-  This  differs  from  the 
s 2  in  Chapter  3  in  the  degrees  of  freedom.  Here,  it  is  n  —  K,  because  we  have  estimated  K 
coefficients,  or  because  (4.2)  represents  K  relationships  among  the  residuals.  Therefore  knowing 
(n  —  K)  residuals  we  can  deduce  the  other  K  residuals  from  (4.2).  E”=i  e?  non-increasing  as 
we  add  more  variables,  but  the  degrees  of  freedom  decrease  by  one  with  every  added  variable. 
Therefore,  s 2  will  decrease  only  if  the  effect  of  the  EEi  e?  decrease  outweighs  the  effect  of  the 
one  degree  of  freedom  loss  on  s 2 .  This  is  exactly  the  idea  behind  R 2,  i.e. ,  penalizing  each  added 
variable  by  decreasing  the  degrees  of  freedom  by  one.  Hence,  this  variable  will  increase  R2  only 
if  the  reduction  in  EEi  el  outweighs  this  loss,  i.e.,  only  if  s 2  is  decreased.  Using  the  definition 
of  R2,  one  can  relate  it  to  R2  as  follows: 

(1  -  R2)  =  (1  -  R2)[(n  -  1  )/{n  -  K)\  (4.16) 

4.6  Testing  Linear  Restrictions 

In  the  simple  linear  regression  chapter,  we  proved  that  the  OLS  estimates  are  BLUE  provided 
assumptions  1  to  4  were  satisfied.  Then  we  imposed  normality  on  the  disturbances,  assumption 
5,  and  proved  that  the  OLS  estimators  are  in  fact  the  maximum  likelihood  estimators.  Then  we 
derived  the  Cramer-Rao  lower  bound,  and  proved  that  these  estimates  are  efficient.  This  will 
be  done  in  matrix  form  in  Chapter  7  for  the  multiple  regression  case.  Under  normality  one  can 
test  hypotheses  about  the  regression.  Basically,  any  regression  package  will  report  the  OLS  esti¬ 
mates,  their  standard  errors  and  the  corresponding  t-statistics  for  the  null  hypothesis  that  each 
individual  coefficient  is  zero.  These  are  tests  of  significance  for  each  coefficient  separately.  But 
one  may  be  interested  in  a  joint  test  of  significance  for  two  or  more  coefficients  simultaneously, 
or  simply  testing  whether  linear  restrictions  on  the  coefficients  of  the  regression  are  satisfied. 
This  will  be  developed  more  formally  in  Chapter  7.  For  now,  all  we  assume  is  that  the  reader  can 
perform  regressions  using  his  or  her  favorite  software  like  EViews,  Stata,  SAS,  TSP,  SHAZAM, 
LIMDEP  or  GAUSS.  The  solutions  to  (4.2)  or  (4.3)  result  in  the  OLS  estimates.  These  multiple 
regression  coefficient  estimates  can  be  interpreted  as  simple  regression  estimates  as  shown  in 
section  4.3.  This  allows  a  simple  derivation  of  their  standard  errors.  Now,  we  would  like  to  use 
these  regressions  to  test  linear  restrictions.  The  strategy  followed  is  to  impose  these  restrictions 
on  the  model  and  run  the  resulting  restricted  regression.  The  corresponding  Restricted  Residual 
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Sums  of  Squares  is  denoted  by  RRSS.  Next,  one  runs  the  regression  without  imposing  these 
linear  restrictions  to  obtain  the  Unrestricted  Residual  Sums  of  Squares,  which  we  denote  by 
URSS.  Finally,  one  forms  the  following  F-statistic: 


( RRSS  -  URSS) /l 
~  URSS /{n  -  K) 


(4.17) 


where  £  denotes  the  number  of  restrictions,  and  n  —  K  gives  the  degrees  of  freedom  of  the 
unrestricted  model.  The  idea  behind  this  test  is  intuitive.  If  the  restrictions  are  true,  then  the 
RRSS  should  not  be  much  different  from  the  URSS.  If  RRSS  is  different  from  URSS,  then 
we  reject  these  restrictions.  The  denominator  of  the  F-statistic  is  a  consistent  estimate  of  the 
unrestricted  regression  variance.  Dividing  by  the  latter  makes  the  F-statistic  invariant  to  units 
of  measurement.  Let  us  consider  two  examples: 

Example  2:  Testing  the  joint  significance  of  two  regression  coefficients.  For  e.g.,  let  us  test  the 
following  null  hypothesis  Hq]  fd2  =  (d3  =  0.  These  are  two  restrictions  (32  =  0  and  fd3  =  0 
and  they  are  to  be  tested  jointly.  We  know  how  to  test  for  /32  =  0  alone  or  fd3  =  0  alone 
with  individual  f-tests.  This  is  a  test  of  joint  significance  of  the  two  coefficients.  Imposing  this 
restriction,  means  the  removal  of  X2  and  X3  from  the  regression,  i.e.,  running  the  regression 
of  Y  on  X4, . . . ,  Xk  excluding  X2  and  X3.  Hence,  the  number  of  parameters  to  be  estimated 
becomes  ( K  —  2)  and  the  degrees  of  freedom  of  this  restricted  regression  are  n  —  ( K  —  2).  The 
unrestricted  regression  is  the  one  including  all  the  A'’s  in  the  model.  Its  degrees  of  freedom 
are  (n  —  K ).  The  number  of  restrictions  are  2  and  this  can  also  be  inferred  from  the  difference 
between  the  degrees  of  freedom  of  the  restricted  and  unrestricted  regressions.  All  the  ingredients 
are  now  available  for  computing  F  in  (4.17)  and  this  will  be  distributed  as  F2^n_K- 

Example  3:  Test  the  equality  of  two  regression  coefficients  Hq]P3  =  /?4  against  the  alternative 
that  H\\  fd3  /  (d4.  Note  that  Hq  can  be  rewritten  as  Hq]  /?3  —  /34  =  0.  This  can  be  tested  using  a 
f-statistic  that  tests  whether  d  =  fd3  —  fd4  is  equal  to  zero.  From  the  unrestricted  regression,  we 
can  obtain  d  =  /33  —  / 34  with  var(d)  =  var(/33)+var(/34)  —  2cov(/33,  /34).  The  variance-covariance 
matrix  of  the  regression  coefficients  can  be  printed  out  with  any  regression  package.  In  section 
4.3,  we  gave  these  variances  and  covariances  a  simple  regression  interpretation.  This  means 

that  se(d)  =  \J var (d)  and  the  f-statistic  is  simply  f  =  (d  —  0 )/se{d)  which  is  distributed  as 
tn-K  under  Hq.  Alternatively,  one  can  run  an  F-test  with  the  RRSS  obtained  from  running  the 
following  regression 


Yi  —  a  +  (d2X2i  +  /?3j(A3j  +  X4i)  +  fd3X^  +  ..  +  fd^X^i  +  rq 


with  fd3  =  fd4  substituted  in  for  fd4.  This  regression  has  the  variable  (A3j  +  AAi)  rather  than  X3j 
and  X4j  separately.  The  URSS  is  the  regression  of  Y  on  all  the  A’s  in  the  model.  The  degrees 
of  freedom  of  the  resulting  F-statistic  are  1  and  n  —  K.  The  numerator  degree  of  freedom  states 
that  there  is  only  one  restriction.  It  will  be  proved  in  Chapter  7  that  the  square  of  the  f-statistic 
is  exactly  equal  to  the  F-statistic  just  derived.  Both  methods  of  testing  are  equivalent.  The  first 
one  computes  only  the  unrestricted  regression  and  involves  some  further  variance  computations, 
while  the  latter  involves  running  two  regressions  and  computing  the  usual  F-statistic. 

Example  4:  Test  the  joint  hypothesis  Hq-,P3  =  1  and  fd2  —  2 fd4  =  0.  These  two  restrictions  are 
usually  obtained  from  prior  information  or  imposed  by  theory.  The  first  restriction  is  (d3  =  1. 


80 


Chapter  4:  Multiple  Regression  Analysis 


The  value  1  could  have  been  any  other  constant.  The  second  restriction  shows  that  a  linear 
combination  of  02  and  /34  is  equal  to  zero.  Substituting  these  restrictions  in  (4.1)  we  get 


Y%  —  a  +  02X2  i  +  -^3  i  +  \02^i  i  +  05^5  i  +  •■  +  0K^Ki  +  Ui 


which  can  be  written  as 

Yi  —  X%i  =  a  +  02{X2  i  +  \X±  i)  +  0§X§  i  +  ..  +  0j^Xxi  +  Ui 

Therefore,  the  RRSS  can  be  obtained  by  regressing  (Y  —  X3)  on  (X2  +  5X4),  X§, . . . ,  Xk-  This 
regression  has  n  —  ( K  —  2)  degrees  of  freedom.  The  URSS  is  the  regression  with  all  the  X’s 
included.  The  resulting  F-statistic  has  2  and  n  —  K  degrees  of  freedom. 

Example  5:  Testing  constant  returns  to  scale  in  a  Cobb-Douglas  production  function.  Q  = 
AKa  L13  E1  Ms  eu  is  a  Cobb-Douglas  production  function  with  capital(F),  labor (L),  energy(F) 
and  material(M).  Constant  returns  to  scale  means  that  a  proportional  increase  in  the  inputs  pro¬ 
duces  the  same  proportional  increase  in  output.  Let  this  proportional  increase  be  A,  then  K*  = 
XK,  L*  =  XL,  E*  =  X E  and  M*  =  AM.  Q*  =  X^a+f3+''f+s'>AKaLf3E^M6eu  =  A("+/3+7+l5)Q. 
For  this  last  term  to  be  equal  to  A Q,  the  following  restriction  must  hold:  a  +  0  +  'y  +  6  =  1. 
Hence,  a  test  of  constant  returns  to  scale  is  equivalent  to  testing  Hq]  a  +  0  +  7  +  <5  =  1.  The 
Cobb-Douglas  production  function  is  nonlinear  in  the  variables,  and  can  be  linearized  by  taking 
logs  of  both  sides,  i.e., 

logQ  =  logA  +  alogK  +  0\og  L  +  ylog  E  +  SYogM  +  u  (4-18) 

This  is  a  linear  regression  with  Y  =  logQ,  X2  =  log K,  X3  =  logL,  X4  =  log E  and  X5  =  logM. 
Ordinary  least  squares  is  BLUE  on  this  non-linear  model  as  long  as  u  satisfies  assumptions 
1-4.  Note  that  these  disturbances  entered  the  original  Cobb-Douglas  production  function  mul- 
tiplicatively  as  exp(tq).  Had  these  disturbances  entered  additively  as  Q  =  AKaL@ F7 Ms  +  u 
then  taking  logs  does  not  simplify  the  right  hand  side  and  one  has  to  estimate  this  with  non¬ 
linear  least  squares,  see  Chapter  8.  Now  we  can  test  constant  returns  to  scale  as  follows.  The 
unrestricted  regression  is  given  by  (4.18)  and  its  degrees  of  freedom  are  n  —  5.  Imposing  Hq 
means  substituting  the  linear  restriction  by  replacing  say  0  by  (1  —  a  —  7  —  6).  This  results  after 
collecting  terms  in  the  following  restricted  regression  with  one  less  parameter 

log  (Q/L)  =  logA  +  alog  (K/L)  +  'ylog(EfL)  +  8\og(M/L)  +  u  (4.19) 

The  degrees  of  freedom  are  n  —  4.  Once  again  all  the  ingredients  for  the  test  in  (4.17)  are  there 
and  this  statistic  is  distributed  as  F\,  n  —  5  under  the  null  hypothesis. 

Example  6:  Joint  significance  of  all  the  slope  coefficients.  The  null  hypothesis  is 

Ho]  02  =  03  =  ■■  =  0K  =  0 

against  the  alternative  H\ ;  at  least  one  0k  /  0  for  k  =  2  Under  the  null,  only  the 

constant  is  left  in  the  regression.  Problem  3.2  showed  that  for  a  regression  of  Y  on  a  constant 
only,  the  least  squares  estimate  of  a  is  Y.  This  means  that  the  corresponding  residual  sum  of 
squares  is  5^=1  (Yi—Y)2.  Therefore,  RRSS  =  Total  sums  of  squares  of  regression  (4.1)  =  E"=1y|. 
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The  URSS  is  the  usual  residual  sums  of  squares  YH=i  el  from  the  unrestricted  regression  given 
by  (4.1).  Hence,  the  corresponding  F-statistic  for  Hq  is 


(TSS  -  RSS)/(K  -  1)  QXi  u l  -  YJU  4) UK  -  1) 

RSS/[n-K)  £?=1  e2J{n-K) 


R2  n-  K 
1  -  R2  '  K  -l 


(4.20) 


where  R2  =  1  —  (X^=i  ei  /  X^=i  Vi)-  This  F-statistic  has  ( K  —  1)  and  (n  —  K )  degrees  of  freedom 
under  Hq,  and  is  usually  reported  by  regression  packages. 


4.7  Dummy  Variables 

Many  explanatory  variables  are  qualitative  in  nature.  For  example,  the  head  of  a  household 
could  be  male  or  female,  white  or  non-white,  employed  or  unemployed.  In  this  case,  one  codes 
these  variables  as  “M”  for  male  and  “F”  for  female,  or  change  this  qualitative  variable  into  a 
quantitative  variable  called  FEMALE  which  takes  the  value  “0”  for  male  and  “1”  for  female. 
This  obviously  begs  the  question:  “why  not  have  a  variable  MALE  that  takes  on  the  value  1  for 
male  and  0  for  female?”  Actually,  the  variable  MALE  would  be  exactly  1-FEMALE.  In  other 
words,  the  zero  and  one  can  be  thought  of  as  a  switch,  which  turns  on  when  it  is  1  and  off  when 
it  is  0.  Suppose  that  we  are  interested  in  the  earnings  of  households,  denoted  by  EARN,  and 
MALE  and  FEMALE  are  the  only  explanatory  variables  available,  then  problem  10  asks  the 
reader  to  verify  that  running  OLS  on  the  following  model: 

EARN  =  a  m  MALE  +  olf  FEMALE  +  u  (4.21) 

gives  olm  =  “average  earnings  of  the  males  in  the  sample”  and  ap  =  “average  earnings  of 
the  females  in  the  sample.”  Notice  that  there  is  no  intercept  in  (4.21),  this  is  because  of  what 
is  known  in  the  literature  as  the  “dummy  variable  trap.”  Briefly  stated,  there  will  be  perfect 
multicollinearity  between  MALE,  FEMALE  and  the  constant.  In  fact,  MALE  +  FEMALE  = 
1.  Some  researchers  may  choose  to  include  the  intercept  and  exclude  one  of  the  sex  dummy 
variables,  say  MALE,  then 

EARN  =  a  +  (3FEMALE  +  u  (4.22) 

and  the  OLS  estimates  give  a  =  “average  earnings  of  males  in  the  sample”  =  clm,  while  f3  = 
ap  —  olm  =  “the  difference  in  average  earnings  between  females  and  males  in  the  sample.” 
Regression  (4.22)  is  more  popular  when  one  is  interested  in  contrasting  the  earnings  between 
males  and  females  and  obtaining  with  one  regression  the  markup  or  markdown  in  average  earn¬ 
ings  (ap  —  olm)  as  well  as  the  test  of  whether  this  difference  is  statistically  different  from  zero. 
This  would  be  simply  the  f-statistic  on  /3  in  (4.22).  On  the  other  hand,  if  one  is  interested  in 
estimating  the  average  earnings  of  males  and  females  separately,  then  model  (4.21)  should  be 
the  one  to  consider.  In  this  case,  the  f-test  for  olf  —  olm  =  0  would  involve  further  calcula¬ 
tions  not  directly  given  from  the  regression  in  (4.21)  but  similar  to  the  calculations  given  in 
Example  3. 

What  happens  when  another  qualitative  variable  is  included,  to  depict  another  classification 
of  the  individuals  in  the  sample,  say  for  example,  race?  If  there  are  three  race  groups  in  the 
sample,  WHITE,  BLACK  and  HISPANIC.  One  could  create  a  dummy  variable  for  each  of 
these  classifications.  For  example,  WHITE  will  take  the  value  1  when  the  individual  is  white 
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and  0  when  the  individual  is  non-white.  Note  that  the  dummy  variable  trap  does  not  allow 
the  inclusion  of  all  three  categories  as  they  sum  up  to  1.  Also,  even  if  the  intercept  is  dropped, 
once  MALE  and  FEMALE  are  included,  perfect  multicollinearity  is  still  present  because  MALE 
+  FEMALE  =  WHITE  +  BLACK  +  HISPANIC.  Therefore,  one  category  from  race  should 
be  dropped.  Suits  (1984)  argues  that  the  researcher  should  use  the  dummy  variable  category 
omission  to  his  or  her  advantage,  in  interpreting  the  results,  keeping  in  mind  the  purpose  of 
the  study.  For  example,  if  one  is  interested  in  comparing  earnings  across  the  sexes  holding  race 
constant,  the  omission  of  MALE  or  FEMALE  is  natural,  whereas,  if  one  is  interested  in  the  race 
differential  in  earnings  holding  gender  constant,  one  of  the  race  variables  should  be  omitted. 
Whichever  variable  is  omitted,  this  becomes  the  base  category  for  which  the  other  earnings  are 
compared.  Most  researchers  prefer  to  keep  an  intercept,  although  regression  packages  allow  for 
a  no  intercept  option.  In  this  case  one  should  omit  one  category  from  each  of  the  race  and  sex 
classifications.  For  example,  if  MALE  and  WHITE  are  omitted: 

EARN  =  a  +  PfFEMALE  +  (3bBLACK  +  /3hHISPANIC  +  u  (4.23) 

Assuming  the  error  u  satisfies  all  the  classical  assumptions,  and  taking  expected  values  of  both 
sides  of  (4.23),  one  can  see  that  the  intercept  a  =  the  expected  value  of  earnings  of  the  omitted 
category  which  is  “white  males”.  For  this  category,  all  the  other  switches  are  off.  Similarly, 
a  +  (3f  is  the  expected  value  of  earnings  of  “white  females,”  since  the  FEMALE  switch  is 
on.  One  can  conclude  that  ftp  =  difference  in  the  expected  value  of  earnings  between  white 
females  and  white  males.  Similarly,  one  can  show  that  a  +  (3B  is  the  expected  earnings  of  “black 
males”  and  a  +  /3F  +  (3B  is  the  expected  earnings  of  “black  females.”  Therefore,  (5F  represents 
the  difference  in  expected  earnings  between  black  females  and  black  males.  In  fact,  problem 
11  asks  the  reader  to  show  that  f3F  represents  the  difference  in  expected  earnings  between 
hispanic  females  and  hispanic  males.  In  other  words,  (3F  represents  the  differential  in  expected 
earnings  between  females  and  males  holding  race  constant.  Similarly,  one  can  show  that  /3B  is 
the  difference  in  expected  earnings  between  blacks  and  whites  holding  sex  constant,  and  (3B  is 
the  differential  in  expected  earnings  between  hispanics  and  whites  holding  sex  constant.  The 
main  key  to  the  interpretation  of  the  dummy  variable  coefficients  is  to  be  able  to  turn  on  and 
turn  off  the  proper  switches,  and  write  the  correct  expectations. 

The  real  regression  will  contain  other  quantitative  and  qualitative  variables,  like 

EARN  =  a  +  (3fFEMALE  +  (3bBLA  CK  +  PH  HISPANIC  +  7l£XP  (4.24) 

+7  2EXP2  +  ^3EDUC  +  74  UNION  +  u 

where  EXP  is  years  of  job  experience,  EDUC  is  years  of  education,  and  UNION  is  1  if  the 
individual  belongs  to  a  union  and  0  otherwise.  EXP2  is  the  squared  value  of  EXP.  Once  again, 
one  can  interpret  the  coefficients  of  these  regressions  by  turning  on  or  off  the  proper  switches.  For 
example,  74  is  interpreted  as  the  expected  difference  in  earnings  between  union  and  non-union 
members  holding  all  other  variables  included  in  (4.24)  constant.  Halvorsen  and  Palmquist  (1980) 
warn  economists  about  the  interpretation  of  dummy  variable  coefficients  when  the  dependent 
variable  is  in  logs.  For  example,  if  the  earnings  equation  is  semi-logarithnric: 

log(Earnings)  =  a  +  /3  UNION  +  7 EDUC  +  u 

then  7  =  %  change  in  earnings  for  one  extra  year  of  education,  holding  union  membership 
constant.  But,  what  about  the  returns  for  union  membership?  If  we  let  Lj  =  log(Earnings) 
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when  the  individual  belongs  to  a  union,  and  Yq  =  log(Earnings)  when  the  individual  does  not 
belong  to  a  union,  then  g  =  %  change  in  earnings  due  to  union  membership  =  (ei_1  —  eY°)/eY°. 
Equivalently,  one  can  write  that  log(l  +  g)  =  Y\  —  Yo  =  /3,  or  that  g  =  —  1.  In  other  words, 

one  should  not  hasten  to  conclude  that  f3  has  the  same  interpretation  as  7.  In  fact,  the  % 
change  in  earnings  due  to  union  membership  is  —  1  and  not  [3.  The  error  involved  in  using  (3 
rather  than  —  1  to  estimate  g  could  be  substantial,  especially  if  (3  is  large.  For  example,  when 
(3  =  0.5,0.75,1;  g  =  eP  —  1  =  0.65,1.12,1.72,  respectively.  Kennedy  (1981)  notes  that  if  (3  is 
unbiased  for  f3,  'g  is  not  necessarily  unbiased  for  g.  However,  consistency  of  [3  implies  consistency 
for  g.  If  one  assumes  log-normal  distributed  errors,  then  E(eP)  =  e'fl+0'5Var^.  Based  on  this 
result,  Kennedy  jT981)  suggests  estimating  g  by  g  =  ed+o.5Var(/3) w]iere  Var(/3)  is  a  consistent 
estimate  of  Var(/3). 

Another  use  of  dummy  variables  is  in  taking  into  account  seasonal  factors,  i.e. ,  including 
3  seasonal  dummy  variables  with  the  omitted  season  becoming  the  base  for  comparison.1  For 
example: 


Sales  =  a  +  f3wW inter  +  (3sSpring  +  (3FFall  +  'y1Price  +  u  (4-25) 

the  omitted  season  being  the  Summer  season,  and  if  (4.25)  models  the  sales  of  air-conditioning 
units,  then  (3F  is  the  difference  in  expected  sales  between  the  Fall  and  Summer  seasons,  holding 
the  price  of  an  air-conditioning  unit  constant.  If  these  were  heating  units  one  may  want  to 
change  the  base  season  for  comparison. 

Another  use  of  dummy  variables  is  for  War  years,  where  consumption  is  not  at  its  normal 
level  say  due  to  rationing.  Consider  estimating  the  following  consumption  function 

Ct  =  ol  +  (3Yt  +  6WAR4  +  ut  t  =  1,2, . . .  ,T  (4.26) 

where  Ct  denotes  real  per  capita  consumption,  Yt  denotes  real  per  capita  personal  disposable 
income,  and  W ARt  is  a  dummy  variable  taking  the  value  1  if  it  is  a  War  time  period  and  0 
otherwise.  Note  that  the  War  years  do  not  affect  the  slope  of  the  consumption  line  with  respect 
to  income,  only  the  intercept.  The  intercept  is  a  in  non- War  years  and  a  +  <5  in  War  years.  In 
other  words,  the  marginal  propensity  out  of  income  is  the  same  in  War  and  non- War  years,  only 
the  level  of  consumption  is  different. 

Of  course,  one  can  dummy  other  unusual  years  like  periods  of  strike,  years  of  natural  disaster, 
earthquakes,  floods,  hurricanes,  or  external  shocks  beyond  control,  like  the  oil  embargo  of  1973. 
If  this  dummy  includes  only  one  year  like  1973,  then  the  dummy  variable  for  1973,  call  it 
H73,  takes  the  value  1  for  1973  and  zero  otherwise.  Including  D73  as  an  extra  variable  in  the 
regression  has  the  effect  of  removing  the  1973  observation  from  estimation  purposes,  and  the 
resulting  regression  coefficients  estimates  are  exactly  the  same  as  those  obtained  excluding 
the  1973  observation  and  its  corresponding  dummy  variable.  In  fact,  using  matrix  algebra  in 
Chapter  7,  we  will  show  that  the  coefficient  estimate  of  H73  is  the  forecast  error  for  1973, 
using  the  regression  that  ignores  the  1973  observations.  In  addition,  the  standard  error  of  the 
dummy  coefficient  estimates  is  the  standard  error  of  this  forecast.  This  is  a  much  easier  way 
of  obtaining  the  forecast  error  and  its  standard  error  from  the  regression  package  without 
additional  computations,  see  Salkever  (1976).  More  on  this  in  Chapter  7. 
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Interaction  Effects 

So  far  the  dummy  variables  have  been  used  to  shift  the  intercept  of  the  regression  keeping 
the  slopes  constant.  One  can  also  use  the  dummy  variables  to  shift  the  slopes  by  letting  them 
interact  with  the  explanatory  variables.  For  example,  consider  the  following  earnings  equation: 

EARN  =  a  +  aFFEMALE  +  /3EDUC  +  u  (4.27) 

In  this  regression,  only  the  intercept  shifts  from  males  to  females.  The  returns  to  an  extra  year 
of  education  is  simply  /?,  which  is  assumed  to  be  the  same  for  males  as  well  as  females.  But  if 
we  now  introduce  the  interaction  variable  (FEMALE  x  EDUC),  then  the  regression  becomes: 

EARN  =  a  +  aF FEMALE  +  (3 EDUC  +  7( FEMALE  x  EDUC)  +  u  (4.28) 

In  this  case,  the  returns  to  an  extra  year  of  education  depends  upon  the  sex  of  the  individual. 
In  fact,  d(EARN)/d(EDUC )  =  (3  +  ^(FEMALE )  =  (3  if  male,  and  (3  +  7  if  female.  Note  that 
the  interaction  variable  =  EDUC  if  the  individual  is  female  and  0  if  the  individual  is  male. 

Estimating  (4.28)  is  equivalent  to  estimating  two  earnings  equations,  one  for  males  and  an¬ 
other  one  for  females,  separately.  The  only  difference  is  that  (4.28)  imposes  the  same  variance 
across  the  two  groups,  whereas  separate  regressions  do  not  impose  this,  albeit  restrictive,  equal¬ 
ity  of  the  variances  assumption.  This  set-up  is  ideal  for  testing  the  equality  of  slopes,  equality 
of  intercepts,  or  equality  of  both  intercepts  and  slopes  across  the  sexes.  This  can  be  done  with 
the  F-test  described  in  (4.17).  In  fact,  for  Hq;  equality  of  slopes,  given  different  intercepts, 
the  restricted  residuals  sum  of  squares  (RRSS)  is  obtained  from  (4.27),  while  the  unrestricted 
residuals  sum  of  squares  (URSS)  is  obtained  from  (4.28).  Problem  12  asks  the  reader  to  set 
up  the  F-test  for  the  following  null  hypothesis:  (i)  equality  of  slopes  and  intercepts,  and  (ii) 
equality  of  intercepts  given  the  same  slopes. 

Dummy  variables  have  many  useful  applications  in  economics.  For  example,  several  tests 
including  the  Chow  (1960)  test,  and  Utts  (1982)  Rainbow  test  described  in  Chapter  8,  can  be 
applied  using  dummy  variable  regressions.  Additionally,  they  can  be  used  in  modeling  splines, 
see  Poirier  (1976)  and  Suits,  Mason  and  Chan  (1978),  and  fixed  effects  in  panel  data,  see 
Chapter  12.  Finally,  when  the  dependent  variable  is  itself  a  dummy  variable,  the  regression 
equation  needs  special  treatment,  see  Chapter  13  on  qualitative  limited  dependent  variables. 

Empirical  Example:  Table  4.1  gives  the  results  of  a  regression  on  595  individuals  drawn  from 
the  Panel  Study  of  Income  Dynamics  (PSID)  in  1982.  This  data  is  provided  on  the  Springer 
web  site  as  EARN.ASC.  A  description  of  the  data  is  given  in  Cornwell  and  Rupert  (1988).  In 
particular,  log  wage  is  regressed  on  years  of  education  (ED),  weeks  worked  (WKS),  years  of 
full-time  work  experience  (EXP),  occupation  (OCC  =  1,  if  the  individual  is  in  a  blue-collar 
occupation),  residence  (SOUTH  =  1,  SMSA  =  1,  if  the  individual  resides  in  the  South,  or 
in  a  standard  metropolitan  statistical  area),  industry  (IND  =  1,  if  the  individual  works  in  a 
manufacturing  industry),  marital  status  (MS  =  1,  if  the  individual  is  married),  sex  and  race 
(FEM  =  1,  BLK  =  1,  if  the  individual  is  female  or  black),  union  coverage  (UNION  =  1,  if  the 
individual’s  wage  is  set  by  a  union  contract).  These  results  show  that  the  returns  to  an  extra  year 
of  schooling  is  5.7%,  holding  everything  else  constant.  It  shows  that  Males  on  the  average  earn 
more  than  Females.  Blacks  on  the  average  earn  less  than  Whites,  and  Union  workers  earn  more 
than  non-union  workers.  Individuals  residing  in  the  South  earn  less  than  those  living  elsewhere. 
Those  residing  in  a  standard  metropolitan  statistical  area  earn  more  on  the  average  than  those 
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Table  4.1  Earnings  Regression  for  1982 


Dependent  Variable:  LWAGE 
Analysis  of  Variance 


Sum  of 

Mean 

Source 

DF 

Squares 

Square 

F  Value 

Prob  >  F 

Model 

12 

52.48064 

4.37339 

41.263 

0.0001 

Error 

582 

61.68465 

0.10599 

C  Total 

594 

114.16529 

Root  MSE 

0.32556 

R-square 

0.4597 

Dep  Mean 

6.95074 

Aclj  R-sq 

0.4485 

C.V. 

4.68377 

Parameter  Estimates 

Parameter 

Standard 

T  for  HO: 

Variable 

DF 

Estimate 

Error 

Parameter=0 

Prob  >  |  T  | 

INTERCEP 

1 

5.590093 

0.19011263 

29.404 

0.0001 

WKS 

1 

0.003413 

0.00267762 

1.275 

0.2030 

SOUTH 

1 

-0.058763 

0.03090689 

-1.901 

0.0578 

SMSA 

1 

0.166191 

0.02955099 

5.624 

0.0001 

MS 

1 

0.095237 

0.04892770 

1.946 

0.0521 

EXP 

1 

0.029380 

0.00652410 

4.503 

0.0001 

EXP  2 

1 

-0.000486 

0.00012680 

-3.833 

0.0001 

OCC 

1 

-0.161522 

0.03690729 

-4.376 

0.0001 

IND 

1 

0.084663 

0.02916370 

2.903 

0.0038 

UNION 

1 

0.106278 

0.03167547 

3.355 

0.0008 

FEM 

1 

-0.324557 

0.06072947 

-5.344 

0.0001 

BLK 

1 

-0.190422 

0.05441180 

-3.500 

0.0005 

ED 

1 

0.057194 

0.00659101 

8.678 

0.0001 

who  do  not.  Individuals  who  work  in  a  manufacturing  industry  or  are  not  blue  collar  workers 
or  are  married  earn  more  on  the  average  than  those  who  are  not.  For  EXP2  =  (EXP)2,  this 
regression  indicates  a  significant  quadratic  relationship  between  earnings  and  experience.  All 
the  variables  were  significant  at  the  5%  level  except  for  WKS,  SOUTH  and  MS. 


Note 


1.  There  are  more  sophisticated  ways  of  seasonal  adjustment  than  introducing  seasonal  dummies,  see 
Judge  et  al.  (1985). 


Problems 

1.  For  the  Cigarette  Data  given  in  Table  3.2.  Run  the  following  regressions: 

(a)  Real  per  capita  consumption  of  cigarettes  on  real  price  and  real  per  capita  income.  (All 
variables  are  in  log  form,  and  all  regressions  in  this  problem  include  a  constant). 
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(b)  Real  per  capita  consumption  of  cigarettes  on  real  price. 

(c)  Real  per  capita  income  on  real  price. 

(d)  Real  per  capita  consumption  on  the  residuals  of  part  (c) . 

(e)  Residuals  from  part  (b)  on  the  residuals  in  part  (c). 

(f)  Compare  the  regression  slope  estimates  in  parts  (d)  and  (e)  with  the  regression  coefficient 
estimate  of  the  real  income  coefficient  in  part  (a) ,  what  do  you  conclude? 

2.  Simple  Versus  Multiple  Regression  Coefficients.  This  is  based  on  Baltagi  (1987b).  Consider  the 
multiple  regression 

Yi  =  a  +  f32X2  i  +  p3X3i  +  m  i  =  l,2,...,n 
along  with  the  following  auxiliary  regressions: 

Xu  =  a  +  bX3i  +  v2  i 
X3  i  =  c  +  dX 2i  +  v3  i 

In  section  4.3,  we  showed  that  /?2,  the  OLS  estimate  of  (32  can  be  interpreted  as  a  simple  regression 
of  Y  on  the  OLS  residuals  v2.  A  similar  interpretation  can  be  given  to  (33 .  Kennedy  (1981,  p.  416) 
claims  that  (32  is  not  necessarily  the  same  as  6 2,  the  OLS  estimate  of  6 2  obtained  from  the  regression 
Y  on  v2l  v3  and  a  constant,  Yj  =  7  +  82u2i  +  83u3i  +  Wi.  Prove  this  claim  by  finding  a  relationship 
between  the  /3’s  and  the  <5’s. 

3.  For  the  simple  regression  T)  =  a  +  /3Xi  +  Ui  considered  in  Chapter  3,  show  that 

(a)  Pols  =  1  xiVi/12'i= 1  xi  can  be  obtained  using  the  residual  interpretation  by  regressing 

X  on  a  constant  first,  getting  the  residuals  V  and  then  regressing  Y  on  v. 

(b)  aoLS  =  Y~  fioLS-X-  can  be  obtained  using  the  residual  interpretation  by  regressing  1  on  X 
and  obtaining  the  residuals  to  and  then  regressing  Y  ort  to. 

(c)  Check  the  var(Sois)  and  var {Pols)  in  Parts  (a)  and  (b)  with  those  obtained  from  the 
residualing  interpretation. 

4.  Effect  of  Additional  Regressors  on  R2.  This  is  based  on  Nieswiadomy  (1986). 

(a)  Suppose  that  the  multiple  regression  given  in  (4.1)  has  Ki  regressors  in  it.  Denote  the  least 
squares  sum  of  squared  errors  by  SSEi.  Now  add  K2  regressors  so  that  the  total  number  of 
regressors  is  K  =  K±  +  K2.  Denote  the  corresponding  least  squares  sum  of  squared  errors 
by  SSE2.  Show  that  SSE2  <  SSEi,  and  conclude  that  the  corresponding  i?-squares  satisfy 
R\  >  R\. 

(b)  Derive  the  equality  given  in  (4.16)  starting  from  the  definition  of  R2  and  R2. 

(c)  Show  that  the  corresponding  R-squares  satisfy  R2  >  R2  when  the  F-statistic  for  the  joint 
significance  of  these  additional  K2  regressors  is  less  than  or  equal  to  one. 

5.  Perfect  Multicollinearity.  Let  Y  be  the  output  and  X2  =  skilled  labor  and  A3  =  unskilled  labor 
in  the  following  relationship: 

Yi  =  a  +  fd2X2  i  +  P3X3  i  +  p4(X2i  +  X3f)  +  P5  X2i  +  P6X3i  +  Ui 
What  parameters  are  estimable  by  OLS? 

6.  Suppose  that  we  have  estimated  the  parameters  of  the  multiple  regression  model: 

Yt  =  P  1  +  P2Xt2  +  p3Xt3  +  ut 

by  Ordinary  Least  Squares  (OLS)  method.  Denote  the  estimated  residuals  by  (et,t  =  1,  ...,T) 
and  the  predicted  values  by  (Yt,t=  1, . . . ,  T). 
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(a)  What  is  the  R 2  of  the  regression  of  e  on  a  constant,  A2  and  X3I 

(b)  If  we  regress  Y  on  a  constant  and  V,  what  are  the  estimated  intercept  and  slope  coeffi¬ 
cients?  What  is  the  relationship  between  the  R2  of  this  regression  and  the  R2  of  the  original 
regression? 

(c)  If  we  regress  Y  on  a  constant  and  e,  what  are  the  estimated  intercept  and  slope  coeffi¬ 
cients?  What  is  the  relationship  between  the  R2  of  this  regression  and  the  R2  of  the  original 
regression? 

(d)  Suppose  that  we  add  a  new  explanatory  variable  X4  to  the  original  model  and  re-estimate 
the  parameters  by  OLS.  Show  that  the  estimated  coefficient  of  X4  and  its  estimated  standard 
error  will  be  the  same  as  in  the  OLS  regression  of  e  on  a  constant,  X2 ,  A3  and  X4. 

7.  Consider  the  Cobb-Douglas  production  function  in  example  5.  How  can  you  test  for  constant 
returns  to  scale  using  a  f-statistic  from  the  unrestricted  regression  given  in  (4.18). 

8.  Testing  Multiple  Restrictions.  For  the  multiple  regression  given  in  (4.1).  Set  up  the  .F-statistic 
described  in  (4.17)  for  testing 

(a)  Hq\  @2  —  P4  —  (3q- 

(b)  Ho;  =  ~P3  and  P5  -  P6  =  1. 

9.  Monte  Carlo  Experiments.  Hanushek  and  Jackson  (1977,  pp.  60-65)  generated  the  following  data 
Y  =  15  +  IX2 i  +  2X3 i  +  Ui  for  i  =  1,2, ...  ,25  with  a  fixed  set  of  X21  and  X3 ,,  and  zq’  s  that 
are  IID  ~  N(0, 100).  For  each  set  of  25  uP  s  drawn  randomly  from  the  normal  distribution,  a 
corresponding  set  of  25  l^’s  are  created  from  the  above  equation.  Then  OLS  is  performed  on  the 
resulting  data  set.  This  can  be  repeated  as  many  times  as  we  can  afford.  400  replications  were 
performed  by  Hanushek  and  Jackson.  This  means  that  they  generated  400  data  sets  each  of  size  25 
and  ran  400  regressions  giving  400  OLS  estimates  of  a ,  P2,  @3  and  a2.  The  classical  assumptions 
are  satisfied  for  this  model,  by  construction,  so  we  expect  these  OLS  estimators  to  be  BLUE,  MLE 
and  efficient. 

(a)  Replicate  the  Monte  Carlo  experiments  of  Hanushek  and  Jackson  (1977)  and  generate  the 
means  of  the  400  estimates  of  the  regression  coefficients  as  well  as  cr2.  Are  these  estimates 
unbiased? 

(b)  Compute  the  standard  deviation  of  these  400  estimates  and  call  this  a b-  Also  compute  the 
average  of  the  400  standard  errors  of  the  regression  estimates  reported  by  the  regression. 
Denote  this  mean  by  sj,.  Compare  these  two  estimates  of  the  standard  deviation  of  the 
regression  coefficient  estimates  to  the  true  standard  deviation  knowing  the  true  cr2 .  What  do 
you  conclude? 

(c)  Plot  the  frequency  of  these  regression  coefficients  estimates?  Does  it  resemble  its  theoretical 
distribution. 

(d)  Increase  the  sample  size  form  25  to  50  and  repeat  the  experiment.  What  do  you  observe? 

10.  Female  and  Male  Dummy  Variables. 

(a)  Derive  the  OLS  estimates  of  aj?  and  cum  for  Y*  =  ctpFi  +  ctMMi  +  m  where  Y  is  Earnings, 
F  is  FEMALE  and  M  is  MALE,  see  (4.21).  Show  that  dip  =  Yp,  the  average  of  the  Y)’s  only 
for  females,  and  cxm  =  Ym,  the  average  of  the  Y)’s  only  for  males. 

(b)  Suppose  that  the  regression  is  Y)  =  a  +  f3Fi  +  m,  see  (4.22).  Show  that  a  =  5m>  and 
p  =  OIF  —  cxm- 

(c)  Substitute  M  =  1  —  F  in  (4.21)  and  show  that  a  =  and  P  =  ctF  —  olm- 

(d)  Verify  parts  (a),  (b)  and  (c)  using  the  earnings  data  underlying  Table  4.1. 
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11.  Multiple  Dummy  Variables.  For  equation  (4.23) 

EARN  =  a  +  PfFEMALE  +  (3bBLACK  +  0hHISPANIC  +  u 

Show  that 

(a)  E(Earnings/Hispanic  Female)  =  a  +  0F  +  (3H\  also  E(Earnings/Hispanic  Male)  =  a  +  (3H. 
Conclude  that  0F  =  E(Earnings/Hispanic  Female)  -  E(Earnings/Hispanic  Male). 

(b)  E(Earnings/Hispanic  Female)  -  E(Earnings/White  Female)  =  E(Earnings/Hispanic  Male)  - 
E(Earnings/White  Male)  =  (3H . 

(c)  E(Earnings/Black  Female)  -  E(Earnings/White  Female)  =  E(Earnings/Black  Male)  - 
E(Earnings/White  Male)  =  j3B . 

12.  For  the  earnings  equation  given  in  (4.28),  how  would  you  set  up  the  E-test  and  what  are  the 
restricted  and  unrestricted  regressions  for  testing  the  following  hypotheses: 

(a)  The  equality  of  slopes  and  intercepts  for  Males  and  Females. 

(b)  The  equality  of  intercepts  given  the  same  slopes  for  Males  and  Females.  Show  that  the 
resulting  E-statistic  is  the  square  of  a  f-statistic  from  the  unrestricted  regression. 

(c)  The  equality  of  intercepts  allowing  for  different  slopes  for  Males  and  Females.  Show  that  the 
resulting  E-statistic  is  the  square  of  a  t-statistic  from  the  unrestricted  regression. 

(d)  Apply  your  results  in  parts  (a),  (b)  and  (c)  to  the  earnings  data  underlying  Table  4.1. 

13.  For  the  earnings  data  regression  underlying  Table  4.1. 

(a)  Replicate  the  regression  results  given  in  Table  4.1. 

(b)  Verify  that  the  joint  significance  of  all  slope  coefficients  can  be  obtained  from  (4.20). 

(c)  How  would  you  test  the  joint  restriction  that  expected  earnings  are  the  same  for  Males  and 
Females  whether  Black  or  Non-Black  holding  everything  else  constant? 

(d)  How  would  you  test  the  joint  restriction  that  expected  earnings  are  the  same  whether  the 
individual  is  married  or  not  and  whether  this  individual  belongs  to  a  Union  or  not? 

(e)  From  Table  4.1  what  is  your  estimate  of  the  %  change  in  earnings  due  to  Union  membership? 
If  the  disturbances  are  assumed  to  be  log-normal,  what  would  be  the  estimate  suggested  by 
Kennedy  (1981)  for  this  %  change  in  earnings? 

(f)  What  is  your  estimate  of  the  %  change  in  earnings  due  to  the  individual  being  married? 

14.  Crude  Quality.  Using  the  data  set  of  U.S.  oil  field  postings  on  crude  prices  ($/barrel),  gravity 
(degree  API)  and  sulphur  (%  sulphur)  given  in  the  CRUDES. ASC  file  on  the  Springer  web  site. 

(a)  Estimate  the  following  multiple  regression  model:  POIL  =  /3i+/32GRAVITY  +  03  SULPHUR 
+  e. 

(b)  Regress  GRAVITY  =  ao  +  oqSULPHUR  +  vt  then  compute  the  residuals  (Pt).  Now  perform 
the  regression 

POIL  =  7!  +  72  vt  +  e 

Verify  that  y2  is  the  same  as  02  in  part  (a).  What  does  this  tell  you? 

(c)  Regress  POIL  =  cj>1  +  02SULPHUR  +  w.  Compute  the  residuals  (ui).  Now  regress  w  on  D 
obtained  from  part  (b),  to  get  Wt  =  S±  +  bVMtP  residuals.  Show  that  62  =  02  in  part  (a). 
Again,  what  does  this  tell  you? 

(d)  To  illustrate  how  additional  data  affects  multicollinearity,  show  how  your  regression  in  part 
(a)  changes  when  the  sample  is  restricted  to  the  first  25  crudes. 

(e)  Delete  all  crudes  with  sulphur  content  outside  the  range  of  1  to  2  percent  and  run  the  multiple 
regression  in  part  (a).  Discuss  and  interpret  these  results. 
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Table  4.2  U.S.  Gasoline  Data:  1950-1987 


Year 

CAR 

QMG 

(1,000  Gallons) 

PMG 

($) 

POP 

(1,000) 

RGNP 

(Billion) 

PGNP 

1950 

49195212 

40617285 

0.272 

152271 

1090.4 

26.1 

1951 

51948796 

43896887 

0.276 

154878 

1179.2 

27.9 

1952 

53301329 

46428148 

0.287 

157553 

1226.1 

28.3 

1953 

56313281 

49374047 

0.290 

160184 

1282.1 

28.5 

1954 

58622547 

51107135 

0.291 

163026 

1252.1 

29.0 

1955 

62688792 

54333255 

0.299 

165931 

1356.7 

29.3 

1956 

65153810 

56022406 

0.310 

168903 

1383.5 

30.3 

1957 

67124904 

57415622 

0.304 

171984 

1410.2 

31.4 

1958 

68296594 

59154330 

0.305 

174882 

1384.7 

32.1 

1959 

71354420 

61596548 

0.311 

177830 

1481.0 

32.6 

1960 

73868682 

62811854 

0.308 

180671 

1517.2 

33.2 

1961 

75958215 

63978489 

0.306 

183691 

1547.9 

33.6 

1962 

79173329 

62531373 

0.304 

186538 

1647.9 

34.0 

1963 

82713717 

64779104 

0.304 

189242 

1711.6 

34.5 

1964 

86301207 

67663848 

0.312 

191889 

1806.9 

35.0 

1965 

90360721 

70337126 

0.321 

194303 

1918.5 

35.7 

1966 

93962030 

73638812 

0.332 

196560 

2048.9 

36.6 

1967 

96930949 

76139326 

0.337 

198712 

2100.3 

37.8 

1968 

101039113 

80772657 

0.348 

200706 

2195.4 

39.4 

1969 

103562018 

85416084 

0.357 

202677 

2260.7 

41.2 

1970 

106807629 

88684050 

0.364 

205052 

2250.7 

43.4 

1971 

111297459 

92194620 

0.361 

207661 

2332.0 

45.6 

1972 

117051638 

95348904 

0.388 

209896 

2465.5 

47.5 

1973 

123811741 

99804600 

0.524 

211909 

2602.8 

50.2 

1974 

127951254 

100212210 

0.572 

213854 

2564.2 

55.1 

1975 

130918918 

102327750 

0.595 

215973 

2530.9 

60.4 

1976 

136333934 

106972740 

0.631 

218035 

2680.5 

63.5 

1977 

141523197 

110023410 

0.657 

220239 

2822.4 

67.3 

1978 

146484336 

113625960 

0.678 

222585 

3115.2 

72.2 

1979 

149422205 

107831220 

0.857 

225055 

3192.4 

78.6 

1980 

153357876 

100856070 

1.191 

227757 

3187.1 

85.7 

1981 

155907473 

100994040 

1.311 

230138 

3248.8 

94.0 

1982 

156993694 

100242870 

1.222 

232520 

3166.0 

100.0 

1983 

161017926 

101515260 

1.157 

234799 

3279.1 

103.9 

1984 

163432944 

102603690 

1.129 

237001 

3489.9 

107.9 

1985 

168743817 

104719230 

1.115 

239279 

3585.2 

111.5 

1986 

173255850 

107831220 

0.857 

241613 

3676.5 

114.5 

1987 

177922000 

110467980 

0.897 

243915 

3847.0 

117.7 

CAR: 

RMG: 

PMG: 


Stock  of  Cars 

Motor  Gasoline  Consumption 
Retail  Price  of  Motor  Gasoline 


POP:  Population 

RGNP:  Real  GNP  in  1982  dollars 

PGNP:  GNP  Deflator  (1982=100) 
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15.  Consider  the  U.S.  gasoline  data  from  1950-1987  given  in  Table  4.2,  and  obtained  from  the  file 
USGAS.ASC  on  the  Springer  web  site. 

(a)  For  the  period  1950-1972  estimate  models  (1)  and  (2): 

log  QMG  =  P1  +  (32\ogCAR  +  p3\ogPOP  +  /34logRGNP  (1) 

+/35\ogPGNP  +  /?6log  PMG  +  u 

,  QMG  ,  RGNP  ,  CAR  ,  PMG 

'°eCAR  =  71  +  '<^~POP  +  7>l0g  POP  +  7‘l0g PGNP  +  "  (2) 

(b)  What  restrictions  should  the  /3’s  satisfy  in  model  (1)  in  order  to  yield  the  y’s  in  model  (2)? 

(c)  Compare  the  estimates  and  the  corresponding  standard  errors  from  models  (1)  and  (2). 

(d)  Compute  the  simple  correlations  among  the  A’s  in  model  (1).  What  do  you  observe? 

(e)  Use  the  Chow-F  test  to  test  the  parametric  restrictions  obtained  in  part  (b) . 

(f)  Estimate  equations  (1)  and  (2)  now  using  the  full  data  set  1950-1987.  Discuss  briefly  the 
effects  on  individual  parameter  estimates  and  their  standard  errors  of  the  larger  data  set. 

(g)  Using  a  dummy  variable,  test  the  hypothesis  that  gasoline  demand  per  CAR  permanently 
shifted  downward  for  model  (2)  following  the  Arab  Oil  Embargo  in  1973? 

(h)  Construct  a  dummy  variable  regression  that  will  test  whether  the  price  elasticity  has  changed 
after  1973. 

16.  Consider  the  following  model  for  the  demand  for  natural  gas  by  residential  sector,  call  it  model 

(1): 


log  Consu  =  P0  +  ^log  Pgit  +  /32logPoit  +  /33logPeit  +  (34logHDDit  +  /35logPIit  +  uit 

where  i  =  1, 2, . . . ,  6  states  and  t  =  1, 2, . . . ,  23  years.  Cons  is  the  consumption  of  natural  gas  by 
residential  sector,  Pg ,  Po  and  Pe  are  the  prices  of  natural  gas,  distillate  fuel  oil,  and  electricity 
of  the  residential  sector.  HDD  is  heating  degree  days  and  PI  is  real  per  capita  personal  income. 
The  data  covers  6  states:  NY,  FL,  MI,  TX,  UT  and  CA  over  the  period  1967-1989.  It  is  given  in 
the  NATURAL. ASC  file  on  the  Springer  web  site. 

(a)  Estimate  the  above  model  by  OLS.  Call  this  model  (1).  What  do  the  parameter  estimates 
imply  about  the  relationship  between  the  fuels? 

(b)  Plot  actual  consumption  versus  the  predicted  values.  What  do  you  observe? 

(c)  Add  a  dummy  variable  for  each  state  except  California  and  run  OLS.  Call  this  model  (2). 
Compute  the  parameter  estimates  and  standard  errors  and  compare  to  model  (1).  Do  any 
of  the  interpretations  of  the  price  coefficients  change?  What  is  the  interpretation  of  the  New 
York  dummy  variable?  What  is  the  predicted  consumption  of  natural  gas  for  New  York  in 
1989? 

(d)  Test  the  hypothesis  that  the  intercepts  of  New  York  and  California  are  the  same. 

(e)  Test  the  hypothesis  that  all  the  states  have  the  same  intercept. 

(f)  Add  a  dummy  variable  for  each  state  and  run  OLS  without  an  intercept.  Call  this  model 
(3).  Compare  the  parameter  estimates  and  standard  errors  to  the  first  two  models.  What  is 
the  interpretation  of  the  coefficient  of  the  New  York  dummy  variable?  What  is  the  predicted 
consumption  of  natural  gas  for  New  York  in  1989? 

(g)  Using  the  regression  in  part  (f),  test  the  hypothesis  that  the  intercepts  of  New  York  and 
California  are  the  same. 
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Appendix 

Residual  Interpretation  of  Multiple  Regression  Estimates 


Proof  of  Claim  1:  Regressing  X2  on  all  the  other  X’s  yields  residuals  v2  that  satisfy  the  usual  properties 
of  OLS  residuals  similar  to  those  in  (4.2),  i.e. , 

EILi  V2i  =  0,  E?=l  ly2iX3i  =  EILi  u2iX4i  =  ..  =  E”=l  u2iXKi  =  0  (A.l) 

Note  that  X2  is  the  dependent  variable  of  this  regression,  and  X2  is  the  predicted  value  from  this 
regression.  The  latter  satisfies  EILi^jAE  =  0.  This  holds  because  X2  is  a  linear  combination  of  the 
other  X’s,  all  of  which  satisfy  (A.l).  Turn  now  to  the  estimated  regression  equation: 


Yi-a  +  (32X2  i  +  ..  +  (dKXxi  +  e* 


(A.2) 


Multiply  (A.2)  by  X2i  and  sum 

E"=i  X2iYi  =  a  E"=i  x2i  +  p2  E"=i  x2i  +  ..  +  PK  E"=i  x2ixKi 


(A.3) 


This  uses  the  fact  that  EILi^^e*  =  0.  Alternatively,  (A.3)  is  just  the  second  equation  from  (4.3). 
Substituting  X2i  =  X2i  +  v2i,  in  (A.3)  one  gets 


EILi  X2iYi  +  E"=i  v2 iYi  —  a  E”=i/2i  +  P2  EILi  xh  +  •• 

+  Pk  E"=l  X2iXKi  +  P 2  E"=l  V\ i 

using  (A.l)  and  the  fact  that  Y,?slX2iv2i  =  0.  Multiply  (A.2)  by  X2i  and  sum,  we  get 

EILi  X2iYi  =  Ot  EILi  X2i  +  @2  EILi  X2 iX2i  +  •■  +  P K  Ei=l  X2iXKi  +  Ei  =  l  X2i&i 


(A.4) 


(A.5) 


But  ELi  X2 ei  =  0  since  A2  is  a  linear  combination  of  all  the  other  X's,  all  of  which  satisfy  (4.2).  Also, 
EILi  x2ix2i  =  EILi  X2i  since  E"=1  X2iMr2,.  =  o.  Hence  (A.5)  reduces  to 


EILi  X2 iYi  —  Q  EILi  X2i  +  P2  EILi  X2i  +  ■■+  Pk  EILi  X2iXKi 

Subtracting  (A. 6)  from  (A.4),  we  get 

Er=i^  =  32EILi4 

and  P2  is  the  slope  estimate  of  the  simple  regression  of  Y  on  v2  as  given  in  (4.5). 
By  substituting  for  Yj  its  expression  from  equation  (4.1)  in  (4.5)  we  get 

P2  =  P‘2  EILi  x2iV2i/  EILi  4  +  EILi  EIl=1  4 


(A.6) 


(A.7) 


(A.8) 


where  E"=i  xu^2i  =  0  and  ElLi^i  =  0.  But,  X2i  =  X2i  +  v2i  and  E IL 1  X2iTr2,  =  0,  which  implies 
that  EILi  x2i^2i  =  EILi  ^2 1  and  P2  =  P2  +  EILi  ^2iUi/  EILi  ^2 i-  This  means  that  P2  is  unbiased  with 
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E(/ 32)  =  P2  since  v2  is  a  linear  combination  of  the  A’s  and  these  in  turn  are  not  correlated  with  the  u's. 
Also, 

var(/32)  =  E0  2  -  P2f  =  E(E7=i  ^2 Wi /  ELi  4)2  =  c2/  Er=i  4 

The  same  results  apply  for  any  (3k  for  k  =  2, . . . ,  K,  i.e., 


/3fe=Er=iWE”=i^ 

where  P*,  is  the  OLS  residual  of  Xk  on  all  the  other  A’s  in  the  regression.  Similarly, 


Pk  =  Pk  +  E"=i  E"=i ? 


and  E(/3k )  =  /3fc  with  var(/?fe)  =  tr2/  E"=i  4  for  k  =  2, . . . ,  if.  Note  also  that 

cov(/?2, 3fc)  =  e02  -  P2 )(Pk  -  Pk)  =  ^(E”=i  p2 *«i/  E”=i  4)(EIU  E”=i  4) 


2  \-~\n  ^  ^  /  \-~\n  ^2  \-~\n 

=  Ei=i^2i^fci/Ei=i^2iEi=i; 


/C2 


(A.9) 
(A.  10) 


Proof  of  Claim  2:  Regressing  F  on  all  the  other  A’s  yields,  F  =  F  +  u>».  Substituting  this  expression 
for  F  in  (4.5)  one  gets 


02  =  (E”=1  ^2iF  +  EIU  ^2i2i)  /  E"=i  4  =  E"=i  ^2i2i  /  E”=1 


(A.ll) 


where  the  last  equality  follows  from  the  fact  that  F  is  a  linear  combination  of  all  A’s  excluding  X2,  all 
of  which  satisfy  (A.l).  Hence  /32  is  the  estimate  of  the  slope  coefficient  in  the  linear  regression  of  to  on 
V2- 


Simple,  Partial  and  Multiple  Correlation  Coefficients 

In  Chapter  3,  we  interpreted  the  square  of  the  simple  correlation  coefficient ,  rxx^ ,  as  the  proportion  of 
the  variation  in  Y  that  is  explained  by  X2.  Similarly,  r'y  Xk  is  the  i?-squared  of  the  simple  regression 
of  Y  on  Xk  for  k  =  2, . . . ,  K .  In  fact,  one  can  compute  these  simple  correlation  coefficients  and  find 
out  which  Xk  is  most  correlated  with  F,  say  it  is  X2.  If  one  is  selecting  regressors  to  include  in  the 
regression  equation,  X2  would  be  the  best  one  variable  candidate.  In  order  to  determine  what  variable 
to  include  next,  we  look  at  partial  correlation  coefficients  of  the  form  ry,xk. x2  for  k  ^  2.  The  square  of 
this  first-order  partial  gives  the  proportion  of  the  residual  variation  in  Y,  not  explained  by  A2,  that  is 
explained  by  the  addition  of  Xk.  The  maximum  first-order  partial  (‘first’  because  it  has  only  one  variable 
after  the  dot)  determines  the  best  candidate  to  follow  X2.  Let  us  assume  it  is  A3.  The  first-order  partial 
correlation  coefficients  can  be  computed  from  simple  correlation  coefficients  as  follows: 

'f'Y.x-i  -  rytx2rX2,x3 

ty,x3.x  2  = - 

see  Johnston  (1984).  Next  we  look  at  second-order  partials  of  the  form  ry,x k.x2,x3  for  k  ^  2,3,  and  so 
on.  This  method  of  selecting  regressors  is  called  forward  selection.  Suppose  there  is  only  X2l  A3  and  A4 
in  the  regression  equation.  In  this  case  (1  —  ry  x2)  is  the  proportion  of  the  variation  in  F,  i.e.,  E"=i  Vi  > 
that  is  not  explained  by  X2.  Also  (1  —  r2-Xa  y2)(1  —  rx  X2)  denotes  the  proportion  of  the  variation  in  Y 
not  explained  after  the  inclusion  of  both  X2  and  X3.  Similarly  (1  —  rx  Xi  Xn  )(1  —  rY x3  x2)(l  —  rY  x2) 
is  the  proportion  of  the  variation  in  Y  unexplained  after  the  inclusion  of  X2,  X3  and  X4.  But  this  is 
exactly  (1  —  R2),  where  R 2  denotes  the  R-squared  of  the  multiple  regression  of  Y  on  a  constant,  A2,  X3 
and  X4.  This  R2  is  called  the  multiple  correlation  coefficient,  and  is  also  written  as  Ry  x-i  X3  Xi .  Hence 

(1  —  Ry.x2,X3,X4)  =  (1  —  rY,. V2)(l  —  rY,. Y3.X2)(1  —  rY,X4.X2,X3) 

and  similar  expressions  relating  the  multiple  correlation  coefficient  to  simple  and  partial  correlation 
coefficients  can  be  written  by  including  say  A3  first  then  A4  and  X2  in  that  order. 


1  -  rYX2 


1  _ 

1  rX2,X3 


CHAPTER  5 

Violations  of  the  Classical  Assumptions 

5.1  Introduction 

In  this  chapter,  we  relax  the  assumptions  made  in  Chapter  3  one  by  one  and  study  the  effect 
of  that  on  the  OLS  estimator.  In  case  the  OLS  estimator  is  no  longer  a  viable  estimator,  we 
derive  an  alternative  estimator  and  propose  some  tests  that  will  allow  us  to  check  whether  this 
assumption  is  violated. 


5.2  The  Zero  Mean  Assumption 

Violation  of  assumption  1  implies  that  the  mean  of  the  disturbances  is  no  longer  zero.  Two 
cases  are  considered: 


Case  1:  E(m )  =  fi  /  0 

The  disturbances  have  a  common  mean  which  is  not  zero.  In  this  case,  one  can  subtract  /x  from 
the  Ui  s  and  get  new  disturbances  u*  =  Ui  —  (j,  which  have  zero  mean  and  satisfy  all  the  other 
assumptions  imposed  on  the  Ui  s.  Having  subtracted  [i  from  ut  we  add  it  to  the  constant  a 
leaving  the  regression  equation  intact: 

Yi  =  a*  +  pXi  +  u*  i  =  1,2, ...  ,n  (5.1) 

where  a*  =  a  +  It  is  clear  that  only  a*  and  /3  can  be  estimated,  and  not  a  nor  //.  In  other 
words,  one  cannot  retrieve  a  and  ^  from  an  estimate  of  a*  without  additional  assumptions  or 
further  information,  see  problem  10.  With  this  reparameterization,  equation  (5.1)  satisfies  the 
four  classical  assumptions,  and  therefore  OLS  gives  the  BLUE  estimators  of  a*  and  /3.  Hence, 
a  constant  non-zero  mean  for  the  disturbances  affects  only  the  intercept  estimate  but  not  the 
slope.  Fortunately,  in  most  economic  applications,  it  is  the  slope  coefficients  that  are  of  interest 
and  not  the  intercept. 


Case  2:  E(ui)  =  ^ 

The  disturbances  have  a  mean  which  varies  with  every  observation.  In  this  case,  one  can  trans¬ 
form  the  regression  equation  as  in  (5.1)  by  adding  and  subtracting  fit.  The  problem,  however, 
is  that  a*  =  a  +  ^  now  varies  with  each  observation,  and  hence  we  have  more  parameters  than 
observations.  In  fact,  there  are  n  intercepts  and  one  slope  to  be  estimated  with  n  observations. 
Unless  we  have  repeated  observations  like  in  panel  data,  see  Chapter  12  or  we  have  some  prior 
information  on  these  a*,  we  cannot  estimate  this  model. 
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5.3  Stochastic  Explanatory  Variables 

Sections  5.5  and  5.6  will  study  violations  of  assumptions  2  and  3  in  detail.  This  section  deals  with 
violations  of  assumption  4  and  its  effect  on  the  properties  of  the  OLS  estimators.  In  this  case, 
X  is  a  random  variable  which  may  be  (i)  independent;  (ii)  contemporaneously  uncorrelated;  or 
(iii)  simply  correlated  with  the  disturbances. 

Case  1:  If  X  is  independent  of  u,  then  all  the  results  of  Chapter  3  still  hold,  but  now  they  are 
conditional  on  the  particular  set  of  X’s  drawn  in  the  sample.  To  illustrate  this  result,  recall 
that  for  the  simple  linear  regression: 

Pols  =  P  +  Ya= i  where  wi  =  xi/  YZ= i  xi  (5-2) 

Hence,  when  we  take  expectations  E(JE,i= i  WiUi)  =  J2?=i  E(wi)E(ui )  =  0.  The  first  equality 
holds  because  X  and  u  are  independent  and  the  second  equality  holds  because  the  it’s  have  zero 
mean.  In  other  words  the  unbiasedness  property  of  the  OLS  estimator  still  holds.  However,  the 

var CPols)  =  i  i )2  =  £”=i  £”=i  E(WlWj)E{Ulu3)  =  a2  £?=i  E(w2) 

where  the  last  equality  follows  from  assumptions  2  and  3,  homoskedasticity  and  no  serial  correla¬ 
tion.  The  only  difference  between  this  result  and  that  of  Chapter  3  is  that  we  have  expectations 
on  the  A’s  rather  than  the  X’s  themselves.  Hence,  by  conditioning  on  the  particular  set  of  X’s 
that  are  observed,  we  can  use  all  the  results  of  Chapter  3.  Also,  maximizing  the  likelihood  in¬ 
volves  both  the  X’s  and  the  u's.  But,  as  long  as  the  distribution  of  the  A’s  does  not  involve  the 
parameters  we  are  estimating,  i.e.,  a,  (3  and  a2,  the  same  maximum  likelihood  estimators  are 
obtained.  Why?  Because  f(xi,X2,...,xn,ui,U2,...,un)  =  fi(xi,  X2,  ■  ■  • ,  xn)f 2(111,  U2,  ■  ■  • ,  un) 
since  the  JA’s  and  the  u' s  are  independent.  Maximizing  /  with  respect  to  ( a ,  (3,  a2)  is  the  same 
as  maximizing  with  respect  to  (a,  (3,  cr2)  as  long  as  f\  is  not  a  function  of  these  parameters. 

Case  2:  Consider  a  simple  model  of  consumption,  where  Yj,  current  consumption,  is  a  function 
of  Yt- 1,  consumption  in  the  previous  period.  This  is  the  case  for  a  habit  forming  consumption 
good  like  cigarette  smoking.  In  this  case  our  regression  equation  becomes 

Yt  =  a  +  pYt-\  +  ut  t  =  2, . . .  ,T  (5.3) 

where  we  lost  one  observation  due  to  lagging.  It  is  obvious  that  Yt  is  correlated  to  ut,  but  the 
question  here  is  whether  Yt-i  is  correlated  to  ut-  After  all,  Yt- 1  is  our  explanatory  variable 
Xt.  As  long  as  assumption  3  is  not  violated,  i.e.,  the  ri’s  are  not  correlated  across  periods,  ut 
represents  a  freshly  drawn  disturbance  independent  of  previous  disturbances  and  hence  is  not 
correlated  with  the  already  predetermined  Yt- 1.  This  is  what  we  mean  by  contemporaneously 
uncorrelated,  i.e.,  ut  is  correlated  with  Y),  but  it  is  not  correlated  with  Y)_i .  The  OLS  estimator 
of  (3  is 

Pols  =  £f=2  VtVt- 1/  EL2  Vt- 1  =  P  +  Ef=2  Vt-iut/ Tl=2  Vt- 1  (5-4) 

and  the  expected  value  of  (5.4)  is  not  f3  because  in  general, 

E{Ylt=2 yt-iut/ Ylt=2 Ut-i)  7 -  E(J2t=2yt-iut)/E(J2t=2yt_i)- 

The  expected  value  of  a  ratio  is  not  the  ratio  of  expected  values.  Also,  even  if  E(Yt-\Ut)  =  0, 
one  can  easily  show  that  E (yt-iut)  p  0.  In  fact,  yt-\  =  Yt-\  —  Y.  and  Y  contains  Yt  in  it,  and 
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we  know  that  E{YtUt)  7^  0.  Hence,  we  lost  the  unbiasedness  property  of  OLS.  However,  all  the 
asymptotic  properties  still  hold.  In  fact,  Pols  is  consistent  because 

plirn  POLS  =  (3  +  cov(yt-i,«t)/vax(yt_i)  =  P  (5.5) 

where  the  second  equality  follows  from  (5.4)  and  the  fact  that  plim(^^_2  Vt-iUt/T)  is 
cov(lt_i,rit)  which  is  zero,  and  plim(^)^=2 yf^/T)  =  var(Yt_i)  which  is  positive  and  finite. 

Case  3:  X  and  u  are  correlated,  in  this  case  OLS  is  biased  and  inconsistent.  This  can  be 
easily  deduced  from  (5.2)  since  plim(]C”=1  XiUi/n)  is  the  co v(X,u)  /  0,  and  plim(^))l=1  xf/n) 
is  positive  and  finite.  This  means  that  OLS  is  no  longer  a  viable  estimator,  and  an  alternative 
estimator  that  corrects  for  this  bias  has  to  be  derived.  In  fact  we  will  study  three  specific  cases 
where  this  assumption  is  violated.  These  are:  (i)  the  errors  in  measurement  case;  (ii)  the  case 
of  a  lagged  dependent  variable  with  correlated  errors;  and  (iii)  simultaneous  equations. 

Briefly,  the  errors  in  measurement  case  involves  a  situation  where  the  true  regression  model 
is  in  terms  of  X*,  but  X*  is  measured  with  error,  i.e. ,  Xi  =  X*  +  i/j,  so  we  observe  Xi  but  not 
X* .  Hence,  when  we  substitute  this  Xi  for  X*  in  the  regression  equation,  we  get 

Yi  =  a  +  PX*  +  m  =  a  +  pxt  +  («*  -  pvi)  (5.6) 

where  the  composite  error  term  is  now  correlated  with  Xj  because  Xi  is  correlated  with  1 Zj. 
After  all,  Xi  =  X*  +  1 \  and  E{XiUp  =  E{yf)  if  X*  and  are  uncorrelated. 

Similarly,  in  case  (ii)  above,  if  the  ids  were  correlated  across  time,  i.e.,  iq_i  is  correlated  with 
iii,  then  yt_i,  which  is  a  function  of  ut- 1,  will  also  be  correlated  with  ut,  and  E(Yt_\Ut)  /  0. 
More  on  this  and  how  to  test  for  serial  correlation  in  the  presence  of  a  lagged  dependent  variable 
in  Chapter  6. 

Finally,  if  one  considers  a  demand  and  supply  equations  where  quantity  Qt  is  a  function  of 
price  Pt  in  both  equations 


Qt  =  a  +  pPt  +  ut  (demand) 

(5.7) 

Qt  =  <5  +  7-Pt  +  1 't  (supply) 

(5.8) 

The  question  here  is  whether  Pt  is  correlated  with  the  disturbances  iq  and  ut  in  both  equations. 
The  answer  is  yes,  because  (5.7)  and  (5.8)  are  two  equations  in  two  unknowns  Pt  and  Qt.  Solving 
for  these  variables,  one  gets  Pt  as  well  as  Qt  as  a  function  of  a  constant  and  both  ut  and  vt-  This 
means  that  E(PtUt)  /  0  and  E(Ptut)  /  0  and  OLS  performed  on  either  (5.7)  or  (5.8)  is  biased 
and  inconsistent.  We  will  study  this  simultaneous  bias  problem  more  rigorously  in  Chapter  11. 

For  all  situations  where  X  and  u  are  correlated,  it  would  be  illuminating  to  show  graphically 
why  OLS  is  no  longer  a  consistent  estimator.  Let  us  consider  the  case  where  the  disturbances 
are,  say,  positively  correlated  with  the  explanatory  variable.  Figure  3.3  of  Chapter  3  shows  the 
true  regression  line  a  +  PX{.  It  also  shows  that  when  Xi  and  Ui  are  positively  correlated  then  an 
Xi  higher  than  its  mean  will  be  associated  with  a  disturbance  m  above  its  mean,  i.e.,  a  positive 
disturbance.  Hence,  Yi  =  a  +  pXi  +  iq  will  always  be  above  the  true  regression  line  whenever 
Xi  is  above  its  mean.  Similarly  Yj  would  be  below  the  true  regression  line  for  every  Xi  below 
its  mean.  This  means  that  not  knowing  the  true  regression  line,  a  researcher  fitting  OLS  on 
this  data  will  have  a  biased  intercept  and  slope.  In  fact,  the  intercept  will  be  understated  and 
the  slope  will  be  overstated.  Furthermore,  this  bias  does  not  disappear  with  more  data,  since 
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this  new  data  will  be  generated  by  the  same  mechanism  described  above.  Hence  these  OLS 
estimates  are  inconsistent. 

Similarly,  if  X%  and  ui  are  negatively  correlated,  the  intercept  will  be  overstated  and  the 
slope  will  be  understated.  This  story  applies  to  any  equation  with  at  least  one  of  its  right  hand 
side  variables  correlated  with  the  disturbance  term.  Correlation  due  to  the  lagged  dependent 
variable  with  autocorrelated  errors,  is  studied  in  Chapter  6,  whereas  the  correlation  due  to  the 
simultaneous  equations  problem  is  studied  in  Chapter  11. 


5.4  Normality  of  the  Disturbances 


If  the  disturbance  are  not  normal,  OLS  is  still  BLUE  provided  assumptions  1-4  still  hold.  Nor¬ 
mality  made  the  OLS  estimators  minimum  variance  unbiased  MVU  and  these  OLS  estimators 
turn  out  to  be  identical  to  the  MLE.  Normality  allowed  the  derivation  of  the  distribution  of 
these  estimators  and  this  in  turn  allowed  testing  of  hypotheses  using  the  t  and  F-tests  consid¬ 
ered  in  the  previous  chapter.  If  the  disturbances  are  not  normal,  yet  the  sample  size  is  large, 
one  can  still  use  the  normal  distribution  for  the  OLS  estimates  asymptotically  by  relying  on  the 
Central  Limit  Theorem,  see  Theil  (1978).  Theil’s  proof  is  for  the  case  of  fixed  V’s  in  repeated 
samples,  zero  mean  and  constant  variance  on  the  disturbances.  A  simple  asymptotic  test  for 
the  normality  assumption  is  given  by  Jarque  and  Bera  (1987).  This  is  based  on  the  fact  that 
the  normal  distribution  has  a  skewness  measure  of  zero  and  a  kurtosis  of  3.  Skewness  (or  lack 
of  symmetry)  is  measured  by 

[E(X  —  fv)3]2  Square  of  the  3rd  moment  about  the  mean 
[E(X  —  n)2]3  Cube  of  the  variance 

Kurtosis  (a  measure  of  flatness)  is  measured  by 

E(X  —  fi)4  4th  moment  about  the  mean 
[E(X  —  fi)2]2  Square  of  the  variance 

For  the  normal  distribution  5  =  0  and  k  =  3.  Hence,  the  Jarque-Bera  (JB)  statistic  is  given  by 


JB  =  n 


(K  ~  3)2~ 
24 


where  5  represents  skewness  and  k  represents  kurtosis  of  the  OLS  residuals.  This  statistic  is 
asymptotically  distributed  as  x2  with  two  degrees  of  freedom  under  Hq.  Rejecting  Ho,  rejects 
normality  of  the  disturbances  but  does  not  offer  an  alternative  distribution.  In  this  sense, 
the  test  is  non-constructive.  In  addition,  not  rejecting  Hq  does  not  necessarily  mean  that  the 
distribution  of  the  disturbances  is  normal,  it  only  means  we  do  not  reject  that  the  distribution 
of  the  disturbances  is  symmetric  and  has  a  kurtosis  of  3.  See  the  empirical  example  in  section 
5.5  for  an  illustration.  The  Jarque-Bera  test  is  part  of  the  standard  output  using  EViews. 


5.5  Heteroskedasticity 

Violation  of  assumption  2,  means  that  the  disturbances  have  a  varying  variance,  i.e.,  E{u 2)  =  a2, 
i  =  1, 2, . . . ,  n.  First,  we  study  the  effect  of  this  violation  on  the  OLS  estimators.  For  the  simple 
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linear  regression  it  is  obvious  that  /?ols  given  in  equation  (5.2)  is  still  unbiased  and  consistent 
because  these  properties  depend  upon  assumptions  1  and  4,  and  not  assumption  2.  However, 
the  variance  of  /3ol5  is  now  different 

var (Pols)  =  =  E”=i  ^  =  E?=  i  *fo?/(E£= i  ^)2  (5-9) 

where  the  second  equality  follows  from  assumption  3  and  the  fact  that  var(uj)  is  now  of. 
Note  that  if  cr2  =  cr2,  this  reverts  back  to  cr2/Yl?=  ixh  the  usual  formula  for  var(/3OLS)  under 
homoskedasticity.  Furthermore,  one  can  show  that  E(s2)  will  involve  all  of  the  cr2’s  and  not  one 
common  cr2,  see  problem  1.  This  means  that  the  regression  package  reporting  s2/ EEi  X1  as  the 
estimate  of  the  variance  of  /3OLS  is  committing  two  errors.  One,  it  is  not  using  the  right  formula 
for  the  variance,  i.e. ,  equation  (5.9).  Second,  it  is  using  s2  to  estimate  a  common  a2  when  in 
fact  the  tr2’s  are  different.  The  bias  from  using  s2/E”=ixf  as  an  estimate  of  var(/?OL5)  will 
depend  upon  the  nature  of  the  heteroskedasticity  and  the  regressor.  In  fact,  if  a2  is  positively 
related  to  x2,  one  can  show  that  s2/EE ix1  understates  the  true  variance  and  hence  the  t- 
statistic  reported  for  f3  =  0  is  overblown,  and  the  confidence  interval  for  (5  is  tighter  than  it  is 
supposed  to  be,  see  problem  2.  This  means  that  the  i-statistic  in  this  case  is  biased  towards 
rejecting  Hq\ [3  =  0,  i.e.,  showing  significance  of  the  regression  slope  coefficient,  when  it  may 
not  be  significant. 

The  OLS  estimator  of  f3  is  linear  unbiased  and  consistent,  but  is  it  still  BLUE?  In  order  to 
answer  this  question,  we  note  that  the  only  violation  we  have  is  that  the  var(rq)  =  a2.  Hence,  if 
we  divided  Ui  by  a  i/a,  the  resulting  u*  =  ouq/ a  *  will  have  a  constant  variance  cr2.  It  is  easy  to 
show  that  u*  satisfies  all  the  classical  assumptions  including  homoskedasticity.  The  regression 
model  becomes 


aYi/ai  =  aa/ai  +  (3aXi/<Ji  +  u* 

and  OLS  on  this  model  (5.10)  is  BLUE.  The  OLS  normal  equations  on  (5.10)  are 

Er=i  (Xi/of)  =  aE-=i(iAu2)  +  /3Er=i  (Xi/of) 

YJU^X./a2)  =  aELiPQAu2)  +  PTZ=i(X?M) 

Note  that  a2  drops  out  of  these  equations.  Solving  (5.11),  see  problem  3,  one  gets 

5  =  [EEi(^M2)/Er=i(iM2)]  -^[Er=i(^/^)/EILi(iM2)]  =  y*  -M* 

with  F*  =  [E”=i(^K2)/Er=i(i/^2)]  =  Er=i<*V£?=i<  a*d 

**  =  E-=i(^M2)/£r=i(iM2)]  =  Er=i^/E”=i< 

where  w*  =  (1/cr2).  Similarly, 

[Er=i(v^)][Er=i(£^/^)]  -  [E-=i(^M2)][Er=i(£/^)] 


P  = 


[E”=iE2M2][Er=i(iM2)]  -  [E?=1  W*?)]2 

(E"=i  <K£k  ~  Ek  <3. 

(E?=i  <)(E£i  <x2)  -  (E£r 

EHi  w^Xi-X*)(Yi-Y*) 

E£i<(^-^)2 


(5.10) 


(5.11) 


(5.12a) 


(5.12b) 
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It  is  clear  that  the  BLU  estimators  5  and  f3,  obtained  from  the  regression  in  (5.10),  are  different 
from  the  usual  OLS  estimators  aoLS  and  Pols  since  they  depend  upon  the  (J2's.  It  is  also  true 
that  when  af  =  cr2  for  all  i  =  1,2 ,  ...,n,  i.e. ,  under  homoskedasticity,  (5.12)  reduces  to  the 
usual  OLS  estimators  given  by  equation  (3.4)  of  Chapter  3.  The  BLU  estimators  weight  the 
i- th  observation  by  (1  jop  which  is  a  measure  of  precision  of  that  observation.  The  more  precise 
the  observation,  i.e.,  the  smaller  <Tj,  the  larger  is  the  weight  attached  to  that  observation,  a 
and  (3  are  also  known  as  Weighted  Least  Squares  (WLS)  estimators  which  are  a  specific  form 
of  Generalized  Least  Squares  (GLS).  We  will  study  GLS  in  details  in  Chapter  9,  using  matrix 
notation. 

Under  heteroskedasticity,  OLS  looses  efficiency  in  that  it  is  no  longer  BLUE.  However,  be¬ 
cause  it  is  still  unbiased  and  consistent  and  because  the  true  ct2's  are  never  known  some  re¬ 
searchers  compute  OLS  as  an  initial  consistent  estimator  of  the  regression  coefficients.  It  is 
important  to  emphasize  however,  that  the  standard  errors  of  these  estimates  as  reported  by  the 
regression  package  are  biased  and  any  inference  based  on  these  estimated  variances  including 
the  reported  f-statistics  are  misleading.  White  (1980)  proposed  a  simple  procedure  that  would 
yield  heteroskedasticity  consistent  standard  errors  of  the  OLS  estimators.  In  equation  (5.9),  this 
amounts  to  replacing  a2  by  e2,  the  square  of  the  i-th  OLS  residual,  i.e., 

White's  var {Pols)  =  E”=  i  ®fef/(E”=i  (5-13) 

Note  that  we  can  not  consistently  estimate  a 2  by  ef.  since  there  is  one  observation  per  parameter 
estimated.  As  the  sample  size  increases,  so  does  the  number  of  unknown  a2' s.  What  White  (1980) 
consistently  estimates  is  the  var  {Pols)  which  is  a  weighted  average  of  the  e2.  The  same  analysis 
applies  to  the  multiple  regression  OLS  estimates.  In  this  case,  White’s  (1980)  heteroskedasticity 
consistent  estimate  of  the  variance  of  the  k-th  OLS  regression  coefficient  Pk,  is  given  by 

White's  var 0k)  =  EIU  "liei /(E"=i  "hi)2 

where  v\  is  the  squared  OLS  residual  obtained  from  regressing  Xk  on  the  remaining  regres¬ 
sors  in  the  equation  being  estimated.  is  the  z-th  OLS  residual  from  this  multiple  regression 
equation.  Many  regression  packages  provide  White’s  heteroskedasticity-consistent  estimates  of 
the  variances  and  their  corresponding  robust  f-statistics.  For  example,  using  EViews,  one  clicks 
on  Quick,  choose  Estimate  Equation.  Now  click  on  Options,  a  menu  appears  where  one  selects 
White  to  obtain  the  heteroskedasticity-consistent  estimates  of  the  variances. 

While  the  regression  packages  correct  for  heteroskedasticity  in  the  f-statistics  they  do  not 
usually  do  that  for  the  F-statistics  studied,  say  in  Example  2  in  Chapter  4.  Wooldridge  (1991) 
suggests  a  simple  way  of  obtaining  a  robust  LM  statistic  for  Ho]  P2  =  P3  =  0  in  the  multiple 
regression  (4.1).  This  involves  the  following  steps: 

(1)  Run  OLS  on  the  restricted  model  without  X2  and  A3  and  obtain  the  restricted  least 
squares  residuals  u. 

(2)  Regress  each  of  the  independent  variables  excluded  under  the  null  (i.e.,  X2  and  A'3)  on  all 
of  the  other  included  independent  variables  (i.e.,  A4,  A5, . . . ,  X k)  including  the  constant. 
Get  the  corresponding  residuals  V2  and  T3,  respectively. 

(3)  Regress  the  dependent  variable  equal  to  1  for  all  observations  on  V2 u,  V3 u  without  a  con¬ 
stant  and  obtain  the  robust  LM  statistic  equal  to  the  n-  the  sum  of  squared  residuals  of 
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this  regression.  This  is  exactly  nR 2  of  this  last  regression.  Under  Hq  this  LM  statistic  is 
distributed  as  x|- 

Since  OLS  is  no  longer  BLUE,  one  should  compute  a  and  [3.  The  only  problem  is  that  the  o/s 
are  rarely  known.  One  example  where  the  o/s  are  known  up  to  a  scalar  constant  is  the  following 
simple  example  of  aggregation. 

Example  5.1:  Aggregation  and  Heteroskedasticity.  Let  Yij  be  the  observation  on  the  j-th  firm 
in  the  i-th  industry,  and  consider  the  following  regression: 

=  a  +  (3Xij  T  Uij  j  =  1,  2, . . . ,  i  =  1, 2, . . . ,  m  (5-14) 

If  only  aggregate  observations  on  each  industry  are  available,  then  (5.14)  is  summed  over  firms, 
i.e., 


Yi  =  ani  +  f3Xi  +  Ui  i  =  l,2,...,m  (5.15) 

where  Y)  =  J2jLi  Yij,  Xi  =  Y^Jj=\  X ij,  Ui  =  YffjLi  uij  for  i  =  1,  2, . . . ,  m.  Note  that  although  the 
Uij’ s  are  IID(0,<r2),  by  aggregating,  we  get  Ui  ~  (0,ni<72).  This  means  that  the  disturbances  in 
(5.15)  are  heteroskedastic.  However,  of  =  niO2  and  is  known  up  to  a  scalar  constant.  In  fact, 
o/c ii  is  1  /(u-j)1/2.  Therefore,  premultiplying  (5.15)  by  1  /{n/)1^2  and  performing  OLS  on  the 
transformed  equation  results  in  BLU  estimators  of  a  and  f3.  In  other  words,  BLU  estimation 
reduces  to  performing  OLS  of  Yi/fn/1/2  on  (n*)1/2  and  Xi/ {n/)1/2 ,  without  an  intercept. 

There  may  be  other  special  cases  in  practice  where  Oi  is  known  up  to  a  scalar,  but  in  general, 
c ij  is  usually  unknown  and  will  have  to  be  estimated.  This  is  hopeless  with  only  n  observations, 
since  there  are  n  a/ s,  so  we  either  have  to  have  repeated  observations,  or  know  more  about  the 
o-j’s.  Let  us  discuss  these  two  cases. 

Case  1:  Repeated  Observations 

Suppose  that  n*  households  are  selected  randomly  with  income  Xi  for  i  =  1,2,  For 

each  household  j  =  1,2,...,  77.*,  we  observe  its  consumption  expenditures  on  food,  say  The 

regression  equation  is 

Y^  =  a  +  (3Xi  +  mj  i  =  1,2, . . .  ,m  ;  j  =  1,2, . . .  ,m  (5.16) 

where  m  is  the  number  of  income  groups  selected.  Note  that  Xt  has  only  one  subscript,  whereas 
Y^  has  two  subscripts  denoting  the  repeated  observations  on  households  with  the  same  income 
Xi.  The  Uij’s  are  independently  distributed  (0,  of)  reflecting  the  heteroskedasticity  in  consump¬ 
tion  expenditures  among  the  different  income  groups.  In  this  case,  there  are  n  =  ni  obser¬ 
vations  and  m  of/s  to  be  estimated.  This  is  feasible,  and  there  are  two  methods  for  estimating 
these  u2’ s.  The  first  is  to  compute 

=  YTjU(Ya  -  Yi)2/(ni  -  i) 

where  Yt  =  Yl3/rii.  The  second  is  to  compute  sf  =  YfJj=\  ^ljlni  where  e*j  is  the  OLS 

residual  given  by 


Sij  —  Y^  —  aoLS  ~  PolsXi 
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Both  estimators  of  a2  are  consistent.  Substituting  either  or  sf  for  of  in  (5.12)  will  result 
in  feasible  estimators  of  a  and  (3.  However,  the  resulting  estimates  are  no  longer  BLUE.  The 
substitution  of  the  consistent  estimators  of  cr2  is  justified  on  the  basis  that  the  resulting  a 
and  P  estimates  will  be  asymptotically  efficient,  see  Chapter  9.  Of  course,  this  step  could  have 
been  replaced  by  a  regression  of  Yij/%  on  (1/sj)  and  (Xj/sj)  without  a  constant,  or  the  similar 
regression  in  terms  of  5*.  For  this  latter  estimate,  s|,  one  can  iterate,  i.e.,  obtaining  new  residuals 
based  on  the  new  regression  estimates  and  therefore  new  'sj.  The  process  continues  until  the 
estimates  obtained  from  the  r-th  iteration  do  not  differ  from  those  of  the  (r  +  l)th  iteration 
in  absolute  value  by  more  than  a  small  arbitrary  positive  number  chosen  as  the  convergence 
criterion.  Once  the  estimates  converge,  the  final  round  estimators  are  the  maximum  likelihood 
estimators,  see  Oberhofer  and  Kmenta  (1974). 


Case  2:  Assuming  More  Information  on  the  Form  of  Heteroskedasticity 

If  we  do  not  have  repeated  observations,  it  is  hopeless  to  try  and  estimate  n  variances  and  a 
and  P  with  only  n  observations.  More  structure  on  the  form  of  heteroskedasticity  is  needed  to 
estimate  this  model,  but  not  necessarily  to  test  it.  Heteroskedasticity  is  more  likely  to  occur 
with  cross-section  data  where  the  observations  may  be  on  firms  with  different  size.  For  example, 
a  regression  relating  profits  to  sales  might  have  heteroskedasticity,  because  larger  firms  have 
more  resources  to  draw  upon,  can  borrow  more,  invest  more,  and  loose  or  gain  more  than 
smaller  firms.  Therefore,  we  expect  the  form  of  heteroskedasticity  to  be  related  to  the  size  of 
the  firm,  which  is  reflected  in  this  case  by  the  regressor,  sales,  or  some  other  variable  that 
measures  size,  like  assets.  Hence,  for  this  regression  we  can  write  a2  =  cr2Z2 ,  where  Zx  denotes 
the  sales  or  assets  of  firm  i.  Once  again  the  form  of  heteroskedasticity  is  known  up  to  a  scalar 
constant  and  the  BLU  estimators  of  a  and  P  can  be  obtained  from  (5.12),  assuming  Z%  is  known. 
Alternatively,  one  can  run  the  regression  of  Yi/Zi  on  1/Zj  and  X f/Zi  without  a  constant  to  get 
the  same  result.  Special  cases  of  Z%  are  X%  and  E(Yf).  (i)  If  Zj  =  X,  the  regression  becomes  that 
of  Yi/Xi  on  1/Xj  and  a  constant.  Note  that  the  regression  coefficient  of  1/Xj  is  the  estimate  of 
a,  while  the  constant  of  the  regression  is  now  the  estimate  of  p.  But,  is  it  possible  to  have  Ui 
uncorrelated  with  Xj  when  we  are  assuming  var (up  related  to  X{!  The  answer  is  yes,  as  long 
as  E(ui/Xi)  =  0,  i.e.,  the  mean  of  Ui  is  zero  for  every  value  of  Xj,  see  Figure  3.4  of  Chapter 
3.  This,  in  turn,  implies  that  the  overall  mean  of  the  ufs  is  zero,  i.e.,  E(uf)  =  0  and  that 
cov(Xj,Uj)  =  0.  If  the  latter  is  not  satisfied  and  say  cov(Xj,iq)  is  positive,  then  large  values 
of  X?;  imply  large  values  of  Ui.  This  would  mean  that  for  these  values  of  Xj,  we  have  a  non¬ 
zero  mean  for  the  corresponding  ufs.  This  contradicts  E(ui/Xp  =  0.  Hence,  if  E{ui/Xf)  =  0, 
then  cov(Xj,'Uj)  =  0.  (ii)  If  Z*  =  E(Yf)  =  a  +  /3X,.,  then  a2  is  proportional  to  the  population 
regression  line,  which  is  a  linear  function  of  a  and  p.  Since  the  OLS  estimates  are  consistent 
one  can  estimate  E(Yf)  by  T)  =  Hols  +  PoLS^i  use  Z*  =  Tj  instead  of  EiYp.  In  other  words, 
run  the  regression  of  Yi/Yi  on  1/Y)  and  Xj/Y)  without  a  constant.  The  resulting  estimates  are 
asymptotically  efficient,  see  Amemiya  (1973). 

One  can  generalize  a2  =  (J2Z2  to  a 2  =  a 2Zf  where  <5  is  an  unknown  parameter  to  be  es¬ 
timated.  Hence  rather  than  estimating  n  cr2,s  one  has  to  estimate  only  cr2  and  8.  Assuming 
normality  one  can  set  up  the  likelihood  function  and  derive  the  first-order  conditions  by  dif¬ 
ferentiating  that  likelihood  with  respect  to  a,  P,  a2  and  8.  The  resulting  equations  are  highly 
nonlinear.  Alternatively,  one  can  search  over  possible  values  for  8  =  0, 0.1,  0.2, . . . ,  4,  and  get  the 
corresponding  estimates  of  a,  /3,  and  a2  from  the  regression  of  Yj/Z on  1/Z^2  and  X?; /Zy/2 
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without  a  constant.  This  is  done  for  every  8  and  the  value  of  the  likelihood  function  is  reported. 
Using  this  search  procedure  one  can  get  the  maximum  value  of  the  likelihood  and  corresponding 
to  it  the  MLE  of  a,  /?,  a2  and  8.  Note  that  as  8  increases  so  does  the  degree  of  heteroskedasticity. 
Problem  4  asks  the  reader  to  compute  the  relative  efficiency  of  the  OLS  estimator  with  respect 
to  the  BLU  estimator  for  Zj  =  Xt  for  various  values  of  8.  As  expected  the  relative  efficiency  of 
the  OLS  estimator  declines  as  the  degree  of  heteroskedasticity  increases. 

One  can  also  generalize  a2  =  o2Z^  to  include  more  Z  variables.  In  fact,  a  general  form  of 
this  multiplicative  heteroskedasticity  is 

log  a2  =  logcr2  +  8i\ogZu  +  82\ogZ2i  +  . . .  +  8r\ogZri  (5.17) 

with  r  <  n,  otherwise  one  cannot  estimate  with  n  observations.  Z\,  Z2, . . . ,  Zr  are  known  vari¬ 
ables  determining  the  heteroskedasticity.  Note  that  if  82  =  83  =  . . .  =  8r  =  0,  we  revert 
back  to  a2  =  cr2Z^,  where  8  =  8 For  the  estimation  of  this  general  multiplicative  form  of 
heteroskedasticity,  see  Harvey  (1976). 

Another  form  for  heteroskedasticity,  is  the  additive  form 

(T2  =  o  +  b\Z\i  +  b2Z2i  +  . . .  +  brZri  (5.18) 

where  r  <  n,  see  Goldfeld  and  Quandt  (1972).  Special  cases  of  (5.18)  include 

(j2  =  o  +  b\Xi  +  b2X2  (5.19) 

where  if  a  and  b\  are  zero  we  have  a  simple  form  of  multiplicative  heteroskedasticity.  In  order 
to  estimate  the  regression  model  with  additive  heteroskedasticity  of  the  type  given  in  (5.19), 
one  can  get  the  OLS  residuals,  the  e*’ s,  and  run  the  following  regression 

e2  =  a  +  b\X\,  +  b2  Xf  +  Vi  (5.20) 

where  Vi  =  e2  —  a 2.  The  Vi  s  are  heteroskedastic,  and  the  OLS  estimates  of  (5.20)  yield  the 
following  estimates  of  a2 

Sf  =  aoLS  +  bipLsXi  +  b2,oLsXf  (5-21) 

One  can  obtain  a  better  estimate  of  the  er2’s  by  computing  the  following  regression  which 
corrects  for  the  heteroskedasticity  in  the  Vi  s 

(e2/ui)  =  a(l/<7j)  +  bi(Xi/di)  +  b2(Xf/ai)  +  Wi  (5.22) 

The  new  estimates  of  a2  are 

(j2  =  a  +  b\Xi  +  b'2  X2  (5.23) 

where  a,  b\  and  b2  are  the  OLS  estimates  from  (5.22).  Using  the  cr2’s  one  can  run  the  regression  of 
Yi/ci  on  (1/uj)  and  Xifoi  without  a  constant  to  get  asymptotically  efficient  estimates  of  a  and 
f3.  These  have  the  same  asymptotic  properties  as  the  MLE  estimators  derived  in  Rutemiller  and 
Bowers  (1968),  see  Amemiya  (1977)  and  Buse  (1984).  The  problem  with  this  iterative  procedure 
is  that  there  is  no  guarantee  that  the  ct2’s  are  positive,  which  means  that  the  square  root  oq 
may  not  exist.  This  problem  would  not  occur  if  <7?  =  (a  +  b\Xi  +  b2X2)2  because  in  this  case 
one  regresses  | e* |  on  a  constant,  Xt  and  X2  and  the  predicted  value  from  this  regression  would 
be  an  estimate  of  <7j.  It  would  not  matter  if  this  predictor  is  negative,  because  we  do  not  have 
to  take  its  square  root  and  because  its  sign  cancels  in  the  OLS  normal  equations  of  the  final 
regression  of  Yi/di  on  (l/ef j)  and  (Aj/aj)  without  a  constant. 
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Testing  for  Homoskedasticity 

In  the  repeated  observation’s  case,  one  can  perform  Bartlett’s  (1937)  test.  The  null  hypothesis 
is  =  a 2  =  ...  =  cr^j.  Under  the  null  there  is  one  variance  a2  which  can  be  estimated 

by  the  pooled  variance  s2  =  Y^u=iv^ilv  where  v  =  and  vi  =  m  —  1.  Under  the 

alternative  hypothesis  there  are  m  different  variances  estimated  by  s?  for  i  =  1,2, ...  ,m.  The 
Likelihood  Ratio  test,  which  computes  the  ratio  of  the  likelihoods  under  the  null  and  alternative 
hypotheses,  reduces  to  computing 

B  =  [idogs2  -  Ya= i  ui  logs?]/c  (5.24) 

where  c  =  1  +  (1/^i)  —  l/z']  /3 (m  —  1).  Under  Hq,  B  is  distributed  Xm- i-  Hence,  a  large 

p- value  for  the  B-statistic  given  in  (5.24)  means  that  we  do  not  reject  homoskedasticity  whereas, 
a  small  p-value  leads  to  rejection  of  Hq  in  favor  of  heteroskedasticity. 

In  case  of  no  repeated  observations,  several  tests  exist  in  the  literature.  Among  these  are  the 
following: 

(1)  Glejser’s  (1969)  Test:  In  this  case  one  regresses  |e,;|  on  a  constant  and  Zf  for  8  = 
1,— 1,0.5  and  —0.5.  If  the  coefficient  of  Zf  is  significantly  different  from  zero,  this  would  lead 
to  a  rejection  of  homoskedasticity.  The  power  of  this  test  depends  upon  the  true  form  of  het¬ 
eroskedasticity.  One  important  result  however,  is  that  this  power  is  not  seriously  impaired  if 
the  wrong  value  of  8  is  chosen,  see  Ali  and  Giaccotto  (1984)  who  confirmed  this  result  using 
extensive  Monte  Carlo  experiments. 

(2)  The  Goldfeld  and  Quandt  (1965)  Test:  This  is  a  simple  and  intuitive  test.  One  orders 
the  observations  according  to  Xi  and  omits  c  central  observations.  Next,  two  regressions  are  run 
on  the  two  separated  sets  of  observations  with  (n  —  c)/2  observations  in  each.  The  c  omitted 
observations  separate  the  low  value  X’s  from  the  high  value  X’s,  and  if  heteroskedasticity 
exists  and  is  related  to  Xj,  the  estimates  of  a2  reported  from  the  two  regressions  should  be 
different.  Hence,  the  test  statistic  is  s2/s2  where  s2  and  s|  are  the  Mean  Square  Error  of  the 
two  regressions,  respectively.  Their  ratio  would  be  the  same  as  that  of  the  two  residual  sums  of 
squares  because  the  degrees  of  freedom  of  the  two  regressions  are  the  same.  This  statistic  is  F- 
distributed  with  ((n  — c) /2 )  —  K  degrees  of  freedom  in  the  numerator  as  well  as  the  denominator. 
The  only  remaining  question  for  performing  this  test  is  the  magnitude  of  c.  Obviously,  the  larger 
c  is,  the  more  central  observations  are  being  omitted  and  the  more  confident  we  feel  that  the 
two  samples  are  distant  from  each  other.  The  loss  of  c  observations  should  lead  to  loss  of  power. 
However,  separating  the  two  samples  should  give  us  more  confidence  that  the  two  variances 
are  in  fact  the  same  if  we  do  not  reject  homoskedasticity.  This  trade  off  in  power  was  studied 
by  Goldfeld  and  Quandt  using  Monte  Carlo  experiments.  Their  results  recommend  the  use  of 
c  =  8  for  n  =  30  and  c  =  16  for  n  =  60.  This  is  a  popular  test,  but  assumes  that  we  know 
how  to  order  the  heteroskedasticity.  In  this  case,  using  Xt.  But  what  if  there  are  more  than  one 
regressor  on  the  right  hand  side?  In  that  case  one  can  order  the  observations  using  Yj. 

(3)  Spearman’s  Rank  Correlation  Test:  This  test  ranks  the  Xt’s  and  the  absolute  value  of 
the  OLS  residuals,  the  ej’s.  Then  it  computes  the  difference  between  these  rankings,  i.e. ,  d%  = 
rank(|ej|)—  rank(X*).  The  Spearman-Correlation  coefficient  is  r  =  1  —  [6X)”=i  d2/(n3  —  n)\. 
Finally,  test  Hq ;  the  correlation  coefficient  between  the  rankings  is  zero,  by  computing  t  = 


5.5  Heteroskcdasticity  105 


[r2(n  —  2)/(l  —  r2)]1/2  which  is  t-distributed  with  (to  —  2)  degrees  of  freedom.  If  this  f-statistic 
has  a  large  p-value  we  do  not  reject  homoskedasticity.  Otherwise,  we  reject  homoskedasticity  in 
favor  of  heteroskedasticity. 

(4)  Harvey’s  (1976)  Multiplicative  Heteroskedasticity  Test:  If  heteroskedasticity  is  re¬ 
lated  to  Xi,  it  looks  like  the  Goldfeld  and  Quandt  test  or  the  Spearman  rank  correlation  test 
would  detect  it,  and  the  Glejser  test  would  establish  its  form.  In  case  the  form  of  heteroskedas¬ 
ticity  is  of  the  multiplicative  type,  Harvey  (1976)  suggests  the  following  test  which  rewrites 
(5.17)  as 

logef  =  logcr2  4-  8\\ogZ\t  +  . . .  +  8r\ogZri  +  vt  (5.25) 

where  Vi  =  log(e2/cx2).  This  disturbance  term  has  an  asymptotic  distribution  that  is  logy;2 .  This 
random  variable  has  mean  —1.2704  and  variance  4.9348.  Therefore,  Harvey  suggests  performing 
the  regression  in  (5.25)  and  testing  Hq;  8±  =  82  =  •  •  •  =  8r  =  0  by  computing  the  regression 
sum  of  squares  divided  by  4.9348.  This  statistic  is  distributed  asymptotically  as  y2.  This  is  also 
asymptotically  equivalent  to  an  T-test  that  tests  for  <!>i  =  82  =  ■  ■  •  =  8r  =  0  in  the  regression 
given  in  (5.25).  See  the  F-test  described  in  example  6  of  Chapter  4. 

(5)  Breusch  and  Pagan  (1979)  Test:  If  one  knows  that  a2  =  /(a  +  b\Z\  +  62^2  +  ••  +  brZr) 
but  does  not  know  the  form  of  this  function  /,  Breusch  and  Pagan  (1979)  suggest  the  following 
test  for  homoskedasticity,  i.e. ,  Ho;bi  =  62  =  . . .  =  br  =  0.  Compute  d2  =  ^”=1e2/n,  which 
would  be  the  MLE  estimator  of  a2  under  homoskedasticity.  Run  the  regression  of  ef/a2on  the 
Z  variables  and  a  constant,  and  compute  half  the  regression  sum  of  squares.  This  statistic  is 
distributed  as  y2.  This  is  a  more  general  test  than  the  ones  discussed  earlier  in  that  /  does  not 
have  to  be  specified. 

(6)  White ’s  (1 980)  Test:  Another  general  test  for  homoskedasticity  where  nothing  is  known 
about  the  form  of  this  heteroskedasticity  is  suggested  by  White  (1980).  This  test  is  based  on 
the  difference  between  the  variance  of  the  OLS  estimates  under  homoskedasticity  and  that 
under  heteroskedasticity.  For  the  case  of  a  simple  regression  with  a  constant,  White  shows  that 
this  test  compares  White’s  vai(/3OLS)  given  by  (5.13)  with  the  usual  v&r(/3OLs)  =  s2 /  xi 
under  homoskedasticity.  This  test  reduces  to  running  the  regression  of  e2  on  a  constant,  Xj 
and  Xf  and  computing  nR2 .  This  statistic  is  distributed  as  y|  under  the  null  hypothesis  of 
homoskedasticity.  The  degrees  of  freedom  correspond  to  the  number  of  regressors  without  the 
constant.  If  this  statistic  is  not  significant,  then  ef  is  not  related  to  Xi  and  Xf  and  we  can  not 
reject  that  the  variance  is  constant.  Note  that  if  there  is  no  constant  in  the  regression,  we  run 
e2  on  a  constant  and  Xf  only,  i.e.,  Xi  is  no  longer  in  this  regression  and  the  degree  of  freedom 
of  the  test  is  1.  In  general,  White’s  test  is  based  on  running  e2  on  the  cross-product  of  all  the 
X’s  in  the  regression  being  estimated,  computing  nR2,  and  comparing  it  to  the  critical  value 
of  xf  where  r  is  the  number  of  regressors  in  this  last  regression  excluding  the  constant.  For  the 
case  of  two  regressors,  X2  and  X3  and  a  constant,  White’s  test  is  again  based  on  nR 2  for  the 
regression  of  e2  on  a  constant,  X2,  Xs,Xf,  A2X3,  Xf.  This  statistic  is  distributed  as  y§.  White’s 
test  is  standard  using  EViews.  After  running  the  regression,  click  on  residuals  tests  then  choose 
White.  This  software  gives  the  user  a  choice  between  including  or  excluding  the  cross-product 
terms  like  A2A3  from  the  regression.  This  may  be  useful  when  there  are  many  regressors. 

A  modified  Breusch-Pagan  test  was  suggested  by  Koenker  (1981)  and  Koenker  and  Bassett 
(1982).  This  attempts  to  improve  the  power  of  the  Breusch-Pagan  test,  and  make  it  more  robust 
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to  the  non-normality  of  the  disturbances.  This  amounts  to  multiplying  the  Breusch-Pagan 
statistic  (half  the  regression  sum  of  squares)  by  2<t4,  and  dividing  it  by  the  second  sample 
moment  of  the  squared  residuals,  i.e. ,  J))/=1(e2  —  ^2)2/n)  where  cf2  =  ^ff=1e2/n.  Waldman 
(1983)  showed  that  if  the  Z£ s  in  the  Breusch-Pagan  test  are  in  fact  the  Aj’s  and  their  cross- 
products,  as  in  White’s  test,  then  this  modified  Breusch-Pagan  test  is  exactly  the  nR 2  statistic 
proposed  by  White. 

White’s  (1980)  test  for  heteroskedasticity  without  specifying  its  form  lead  to  further  work 
on  estimators  that  are  more  efficient  than  OLS  while  recognizing  that  the  efficiency  of  GLS 
may  not  be  achievable,  see  Cragg  (1992).  Adaptive  estimators  have  been  developed  by  Carroll 
(1982)  and  Robinson  (1987).  These  estimators  assume  no  particular  form  of  heteroskedasticity 
but  nevertheless  have  the  same  asymptotic  distribution  as  GLS  based  on  the  true  of. 

Many  Monte  Carlo  experiments  were  performed  to  study  the  performance  of  these  and  other 
tests  of  homoskedasticity.  One  such  extensive  study  is  that  of  Ali  and  Giaccotto  (1984).  Six 
types  of  heteroskedasticity  specifications  were  considered; 

(ii)  of  =  <j2\Xi\  (hi)  of  =  cr2\E{Yi)\ 

(v)  a2  =  <72[E(Yi)}2  (vi)  a2  =  a2  for  i  <  n/2 
and  of  =  2o-2  for  i  >  n/2 

Six  data  sets  were  considered,  the  first  three  were  stationary  and  the  last  three  were  nonsta¬ 
tionary  (Stationary  versus  non-stationary  regressors,  are  discussed  in  Chapter  14).  Five  models 
were  entertained,  starting  with  a  model  with  one  regressor  and  no  intercept  and  finishing  with 
a  model  with  an  intercept  and  5  variables.  Four  types  of  distributions  were  imposed  on  the 
disturbances.  These  were  normal,  t,  Cauchy  and  log  normal.  The  first  three  are  symmetric,  but 
the  last  one  is  skewed.  Three  sample  sizes  were  considered,  n  =  10, 25, 40.  Various  correlations 
between  the  disturbances  were  also  entertained.  Among  the  tests  considered  were  tests  1,  2,  5 
and  6  discussed  in  this  section.  The  results  are  too  numerous  to  summarize,  but  some  of  the 
major  findings  are  the  following:  (1)  The  power  of  these  tests  increased  with  sample  size  and 
trendy  nature  or  the  variability  of  the  regressors.  It  also  decreased  with  more  regressors  and 
for  deviations  from  the  normal  distribution.  The  results  were  mostly  erratic  when  the  errors 
were  autocorrelated.  (2)  There  were  ten  distributionally  robust  tests  using  OLS  residuals  named 
TROB  which  were  variants  of  Glejser’s,  White’s  and  Bickel’s  tests.  The  last  one  being  a  non- 
parametric  test  not  considered  in  this  chapter.  These  tests  were  robust  to  both  long-tailed  and 
skewed  distributions.  (3)  None  of  these  tests  has  any  significant  power  to  detect  heteroskedas¬ 
ticity  which  deviates  substantially  from  the  true  underlying  heteroskedasticity.  For  example, 
none  of  these  tests  was  powerful  in  detecting  heteroskedasticity  of  the  sixth  kind,  i.e.,  of  = 
a2  for  i  <  n/2  and  of  =  2 a2  for  i  >  n/2.  In  fact,  the  maximum  power  was  9%.  (4)  Ali  and 
Giaccotto  (1984)  recommend  any  of  the  TROB  tests  for  practical  use.  They  note  that  the  sim¬ 
ilarity  among  these  tests  is  the  use  of  squared  residuals  rather  than  the  absolute  value  of  the 
residuals.  In  fact,  they  argue  that  tests  of  the  same  form  that  use  absolute  value  rather  than 
squared  residuals  are  likely  to  be  non-robust  and  lack  power. 

Empirical  Example:  For  the  Cigarette  Consumption  Data  given  in  Table  3.2,  the  OLS  regression 
yields: 


W 

(iv 


a2  =  u2 


a 


2  _ 


=  a2Xf 


log C=  4.30  -  1.34  logP+  0.17  logV  R2  =  0.27 
(0.91)  (0.32)  (0.20) 
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Figure  5.1  Plots  of  Residuals  Versus  Log  Y 


Suspecting  heteroskedasticity,  we  plotted  the  residuals  from  this  regression  versus  logT  in  Figure 
5.1.  This  figure  shows  the  dispersion  of  the  residuals  to  decrease  with  increasing  logy.  Next, 
we  performed  several  tests  for  heteroskedasticity  studied  in  this  chapter.  The  first  is  Glejser’s 
(1969)  test.  We  ran  the  following  regressions: 

\ei\  =  1.16  -  0.22  logT 
(0.46)  (0.10) 

\ei\  =  -0.95+  5.13  (logT)"1 
(0.47)  (2.23) 

\ei\  =  —2.00  +  4.65  (logy)"0-5 
(0.93)  (2.04) 


\ei\  =  2.21  -  0.96  (logy)0'5 
(0.93)  (0.42) 

The  t-statistics  on  the  slope  coefficient  in  these  regressions  are  —2.24,  2.30,  2.29  and  —2.26, 
respectively.  All  are  significant  with  p-values  of  0.03,  0.026,  0.027  and  0.029,  respectively,  indi¬ 
cating  the  rejection  of  homoskedasticity. 

The  second  test  is  the  Goldfeld  and  Quandt  (1965)  test.  The  observations  are  ordered  accord¬ 
ing  to  logT  and  c  =  12  central  observations  are  omitted.  Two  regressions  are  run  on  the  first  and 
last  17  observations.  The  first  regression  yields  sf  =  0.04881  and  the  second  regression  yields 
s\  =  0.01554.  This  is  a  test  of  equality  of  variances  and  it  is  based  on  the  ratio  of  two  x2  ran- 
dom  variables  with  17  —  3  =  14  degrees  of  freedom.  In  fact,  s\/ s2  =  0.04881/0.01554  =  3.141  ~ 
+1444  under  Hq.  This  has  a  p- value  of  0.02  and  rejects  Hq  at  the  5%  level.  The  third  test 
is  the  Spearman  rank  correlation  test.  First  one  obtains  the  rank(loglj)  and  rank(|ej|)  and 
compute  di  =  rank|ej|—  rankllogy/.  From  these  r  =  1  —  6  —  n)  =  —0.282  and 
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t  =  [r2(n  —  2)/(l  —  r2)]1/2  =  1.948.  This  is  distributed  as  a  t  with  44  degrees  of  freedom.  This 
f-statistic  has  a  p-value  of  0.058. 

The  fourth  test  is  Harvey’s  (1976)  multiplicative  heteroskedasticity  test  which  is  based  upon 
regressing  loge2  on  log  (log  Yi) 

loge2  =  24.85  —  19.08  log  (log  F) 

(17.25)  (11.03) 

Harvey’s  (1976)  statistic  divides  the  regression  sum  of  squares  which  is  14.360  by  4.9348.  This 
yields  2.91  which  is  asymptotically  distributed  as  Xi  under  the  null.  This  has  a  p-value  of  0.088 
and  does  not  reject  the  null  of  homoskedasticity  at  the  5%  significance  level. 

The  fifth  test  is  the  Breusch  and  Pagan  (1979)  test  which  is  based  on  the  regression  of  e2/a" 
(where  a2  =  X^=ief/46  =  0.024968)  on  logT).  The  test-statistic  is  half  the  regression  sum  of 
squares  =  (10.971  -f  2)  =  5.485.  This  is  distributed  as  xl  under  the  null  hypothesis.  This  has 
a  p-value  of  0.019  and  rejects  the  null  of  homoskedasticity.  This  can  be  generated  with  Stata 
after  running  the  OLS  regression  reported  above,  and  then  issuing  the  command  estat  hettest 
Inrdi 

.estat  hettest  Inrdi 

Breusch-Pagan  /  Cook-Weisberg  test  for  heteroskedasticity 
Ho:  Constant  variance 
Variables:  Inrdi 


chi2(l)  =  5.49 

Prob  >  chi2  =  0.0192 

Finally,  White’s  (1980)  test  for  heteroskedasticity  is  performed  which  is  based  on  the  regression 
of  e2  on  logH,  logV,  (logP)2,  (logy)2,  (logP)(logT)  and  a  constant.  This  is  shown  in  Table 
5.1  using  EViews.  The  test-statistic  is  nR 2  =  (46)(0.3404)  =  15.66  which  is  distributed  as  x§- 
This  has  a  p-value  of  0.008  and  rejects  the  null  of  homoskedasticity.  Except  for  Harvey’s  test, 
all  the  tests  performed  indicate  the  presence  of  heteroskedasticity.  This  is  true  despite  the  fact 
that  the  data  are  in  logs,  and  both  consumption  and  income  are  expressed  in  per  capita  terms. 
White’s  heteroskedasticity-consistent  estimates  of  the  variances  are  as  follows: 

logC  =  4.30  —  1.34  logP  +  0.17  logy 
(1.10)  (0.34)  (0.24) 


These  are  given  in  Table  5.2  using  EViews.  Note  that  in  this  case  all  of  the  heteroskedasticity- 
consistent  standard  errors  are  larger  than  those  reported  using  a  standard  OLS  package,  but 
this  is  not  necessarily  true  for  other  data  sets. 

In  section  5.4,  we  described  the  Jarque  and  Bera  (1987)  test  for  normality  of  the  disturbances. 
For  this  cigarette  consumption  regression,  Figure  5.2  gives  the  histogram  of  the  residuals  along 
with  descriptive  statistics  of  these  residuals  including  their  mean,  median,  skewness  and  kurtosis. 

This  is  done  using  EViews.  The  measure  of  skewness  S  is  estimated  to  be  —0.184  and  the 
measure  of  kurtosis  k  is  estimated  to  be  2.875  yielding  a  Jarque-Bera  statistic  of 


(-0.184)2 


(2.875  -3)2 


JB  =  46 


6 


24 


=  0.29. 
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Table  5.1  White  Heteroskedasticity  Test 


F-statistic 

4.127779 

Probability 

0.004073 

Obs*R-squared 

15.65644 

Probability 

0.007897 

Test  Equation: 

Dependent  Variable: 

RESID '2 

Method: 

Least  Squares 

Sample: 

1  46 

Included  observations: 

46 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

18.22199 

5.374060 

3.390730 

0.0016 

LNP 

9.506059 

3.302570 

2.878382 

0.0064 

LNP*  2 

1.281141 

0.656208 

1.952340 

0.0579 

LNP*LNY 

-2.078635 

0.727523 

-2.857139 

0.0068 

LNY 

-7.893179 

2.329386 

-3.388523 

0.0016 

LNY*  2 

0.855726 

0.253048 

3.381670 

0.0016 

R-squared 

0.340357 

Mean  dependent  var 

0.024968 

Adjusted  R-squared 

0.257902 

S.D.  dependent  var 

0.034567 

S.E.  of  regression 

0.029778 

Akaike  info  criterion 

-4.068982 

Sum  squared  resid 

0.035469 

Schwarz  criterion 

-3.830464 

Log  likelihood 

99.58660 

F-statistic 

4.127779 

Durbin- Watson  stat 

1.853360 

Prob  (F-statistic) 

0.004073 

Table  5.2  White  Heteroskedasticity-Consistent  Standard  Errors 


Dependent  Variable: 

LNC 

Method: 

Least  Squares 

Sample: 

1  46 

Included  observations: 

46 

White  Heteroskedasticity-Consistent  Standard  Errors  &  Covariance 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

4.299662 

1.095226 

3.925821 

0.0003 

LNP 

-1.338335 

0.343368 

-3.897671 

0.0003 

LNY 

0.172386 

0.236610 

0.728565 

0.4702 

R-squared 

0.303714 

Mean  dependent  var 

4.847844 

Adjusted  R-squared 

0.271328 

S.D.  dependent  var 

0.191458 

S.E.  of  regression 

0.163433 

Akaike  info  criterion 

-0.721834 

Sum  squared  resid 

1.148545 

Schwarz  criterion 

-0.602575 

Log  likelihood 

19.60218 

F-statistic 

9.378101 

Durbin- Watson  stat 

2.315716 

Prob  (F-statistic) 

0.000417 
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Series:  Residuals 
Sample  1  46 
Observations  46 


Mean 

Median 

Maximum 

Minimum 

Std.  Dev. 

Skewness 

Kurtosis 


-9.90E-16 

0.007568 

0.328677 

-0.418675 

0.159760 

-0.183945 

2.875020 


Jarque-Bera  0.289346 

Probability  0.865305 


Figure  5.2  Normality  Test  (Jarque-Bera) 


This  is  distributed  as  under  the  null  hypothesis  of  normality  and  has  a  p- value  of  0.865.  Hence 
we  do  not  reject  that  the  distribution  of  the  disturbances  is  symmetric  and  has  a  kurtosis  of  3. 


5.6  Autocorrelation 

Violation  of  assumption  3  means  that  the  disturbances  are  correlated,  i.e. ,  E(uiUj)  =  <Jij  /  0,  for 
i  /  j,  and  i.j  =  1,2, ...  ,n.  Since  Ui  has  zero  mean,  E{uiUj)  =  co  v(ui,  Uj )  and  this  is  denoted  by 
a ij.  This  correlation  is  more  likely  to  occur  in  time-series  than  in  cross-section  studies.  Consider 
estimating  the  consumption  function  of  a  random  sample  of  households.  An  unexpected  event, 
like  a  visit  of  family  members  will  increase  the  consumption  of  this  household.  However,  this 
positive  disturbance  need  not  be  correlated  to  the  disturbances  affecting  consumption  of  other 
randomly  drawn  households.  However,  if  we  were  estimating  this  consumption  function  using 
aggregate  time-series  data  for  the  U.S.,  then  it  is  very  likely  that  a  recession  year  affecting 
consumption  negatively  this  year  may  have  a  carry  over  effect  to  the  next  few  years.  A  shock 
to  the  economy  like  an  oil  embargo  in  1973  is  likely  to  affect  the  economy  for  several  years.  A 
labor  strike  this  year  may  affect  production  for  the  next  few  years.  Therefore,  we  will  switch 
the  i  and  j  subscripts  to  t  and  s  denoting  time-series  observations  t,s  =  1,2 , ...  ,T  and  the 
sample  size  will  be  denoted  by  T  rather  than  n.  This  covariance  term  is  symmetric,  so  that 
o  12  =  E(u\U2)  =  E(u2U\)  =  o 2i .  Hence,  only  T(T  —  l)/2  distinct  oys’s  have  to  be  estimated. 
For  example,  if  T  =  3,  then  a  12,  013  and  023  are  the  distinct  covariance  terms.  However,  it  is 
hopeless  to  estimate  T(T  —  l)/2  covariances  (crts)  with  only  T  observations.  Therefore,  more 
structure  on  these  oys’s  need  to  be  imposed.  A  popular  assumption  is  that  the  ut’s  follow  a 
first-order  autoregressive  process  denoted  by  AR(1): 


ut  =  put-i  +  et  t  =  1, 2, . . . ,  T 


(5.26) 
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where  et  is  IID(0,(j2).  It  is  autoregressive  because  ut  is  related  to  its  lagged  value  ut- i-  One 
can  also  write  (5.26),  for  period  t  —  1,  as 


ut- 1  =  put- 2  +  et-i 


(5.27) 


and  substitute  (5.27)  in  (5.26)  to  get 

ut  =  p2ut- 2  +  pet- 1  +  et  (5.28) 

Note  that  the  power  of  p  and  the  subscript  of  u  or  e  always  sum  to  t.  By  continuous  substitution 
of  this  form,  one  ultimately  gets 

Ut  =  ptuo  +  pt  1ei  +  ..  +  pet-i  +  et  (5.29) 

This  means  that  ut  is  a  function  of  current  and  past  values  of  et  and  uq  where  uq  is  the  initial 
value  of  ut-  If  uq  has  zero  mean,  then  ut  has  zero  mean.  This  follows  from  (5.29)  by  taking 
expectations.  Also,  from  (5.26) 

var('Ui)  =  p2var(iii_i)  +  var(et)  +  2pcov(ut-i,  et)  (5.30) 

Using  (5.29),  ut-i  is  a  function  of  et- 1,  past  values  of  et~i  and  uq.  Since  uq  is  independent  of  the 
e’s,  and  the  e’s  are  themselves  not  serially  correlated,  then  ut- 1  is  independent  of  et  .  This  means 
that  cov{ut~i,et)  =  0.  Furthermore,  for  ut  to  be  homoskedastic,  var (ut)  =  vav(ut-i)  =  <r2,  and 
(5.30)  reduces  to  a2  =  p2cr2  +  <r2,  which  when  solved  for  a\2  gives: 

a2  =  a2/(l  -  p2)  (5.31) 

Hence,  uq  ~  (0,  a2/(l  —  p2))  for  the  it’s  to  have  zero  mean  and  homoskedastic  disturbances. 
Multiplying  (5.26)  by  ut-i  and  taking  expected  values,  one  gets 

E(utut- 1)  =  pE(u 2_t)  +  E(ut- iet)  =  pal  (5.32) 

since  Fi(«2_1)  =  cr2  and  E{ut-\et)  =  0.  Therefore,  cov(ut,ut-i)  =  pal,  and  the  correlation  coef¬ 
ficient  between  ut  and  ut-i  is  correl(it£,  ut-i)  =  co v(ut,  ut-i) /i/var(tii)var(rit_i)  =  pal/al  =  p. 
Since  p  is  a  correlation  coefficient,  this  means  that  —  1  <  p  <  1.  In  general,  one  can  show  that 

co v(ut,us)  =  p^^al  t,s  =  1,2, . . .  ,T  (5.33) 

see  problem  6.  This  means  that  the  correlation  between  ut  and  ut~r  is  pr ,  which  is  a  fraction 
raised  to  an  integer  power,  i.e.,  the  correlation  is  decaying  between  the  disturbances  the  further 
apart  they  are.  This  is  reasonable  in  economics  and  may  be  the  reason  why  this  autoregressive 
form  (5.26)  is  so  popular.  One  should  note  that  this  is  not  the  only  form  that  would  correlate  the 
disturbances  across  time.  In  Chapter  14,  we  will  consider  other  forms  like  the  Moving  Average 
(MA)  process,  and  higher  order  Autoregressive  Moving  Average  (ARMA)  processes,  but  these 
are  beyond  the  scope  of  this  chapter. 
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Consequences  for  OLS 

How  is  the  OLS  estimator  affected  by  the  violation  of  the  no  autocorrelation  assumption  among 
the  disturbances?  The  OLS  estimator  is  still  unbiased  and  consistent  since  these  properties  rely 
on  assumptions  1  and  4  and  have  nothing  to  do  with  assumption  3.  For  the  simple  linear 
regression,  using  (5.2),  the  variance  of  Pols  is  now 

var (Pols)  =  var(EtLi  wtUt)  =  ELi  EJ=i  wtwscov(ut,  us)  (5.34) 

=  ■feL^+eew-'^ 

t^s 

where  co v(ut,us)  =  as  explained  in  (5.33).  Note  that  the  first  term  in  (5.34)  is  the 

usual  variance  of  Pols  under  the  classical  case.  The  second  term  in  (5.34)  arises  because  of  the 
correlation  between  the  ut  s.  Hence,  the  variance  of  OLS  computed  from  a  regression  package, 
i.e. ,  s2 /  E;Li  xt  is  a  wrong  estimate  of  the  variance  of  Pols  f°r  two  reasons.  First,  it  is  using 
the  wrong  formula  for  the  variance,  i.e.,  <r2/  Et=i  xt  rather  than  (5.34).  The  latter  depends  on  p 
through  the  extra  term  in  (5.34).  Second,  one  can  show,  see  problem  7,  that  E(s2)  7^  a\  and  will 
involve  p  as  well  as  o2u.  Hence,  s 2  is  not  unbiased  for  a2u  and  s2 /  E?=i  xt  is  a  biased  estimate  of 
var (Pols)-  The  direction  and  magnitude  of  this  bias  depends  on  p  and  the  regressor.  In  fact,  if  p 
is  positive,  and  the  xt  s  are  themselves  positively  autocorrelated,  then  s2/Ec=  1  xt  understates 
the  true  variance  of  Pols •  This  means  that  the  confidence  interval  for  P  is  tighter  than  it  should 
be  and  the  t-statistic  for  P  =  0  is  overblown,  see  problem  8.  As  in  the  heteroskedastic 
case,  but  for  completely  different  reasons,  any  inference  based  on  var(/3OLS)  reported  from  the 
standard  regression  packages  will  be  misleading  if  the  ut  s  are  serially  correlated. 

Newey  and  West  (1987)  suggested  a  simple  heteroskedasticity  and  autocorrelation-consistent 
covariance  matrix  for  the  OLS  estimator  without  specifying  the  functional  form  of  the  serial 
correlation.  The  basic  idea  extends  White’s  (1980)  replacement  of  heteroskedastic  variances 
with  squared  OLS  residuals  e2  by  additionally  including  products  of  least  squares  residuals 
etet-s  for  s  =  0,  ±1, . . . ,  =t p  where  p  is  the  maximum  order  of  serial  correlation  we  are  willing  to 
assume.  The  consistency  of  this  procedure  relies  on  p  being  very  small  relative  to  the  number 
of  observations  T.  This  is  consistent  with  popular  serial  correlation  specifications  considered 
in  this  chapter  where  the  autocorrelation  dies  out  quickly  as  j  increases.  Newey  and  West 
(1987)  allow  the  higher  order  covariance  terms  to  receive  diminishing  weights.  This  Newey- 
West  option  for  the  least  squares  estimator  is  available  using  EViews.  Andrews  (1991)  warns 
about  the  unreliability  of  such  standard  error  corrections  in  some  circumstances.  Wooldridge 
(1991)  shows  that  it  is  possible  to  construct  serially  correlated  robust  F-statistics  for  testing 
joint  hypotheses  as  considered  in  Chapter  4.  However,  these  are  beyond  the  scope  of  this  book. 

Is  OLS  still  BLUE?  In  order  to  determine  the  BLU  estimator  in  this  case,  we  lag  the  regression 
equation  once,  multiply  it  by  p,  and  subtract  it  from  the  original  regression  equation,  we  get 

Yt  -  pYt-i  =  a(l  —  p)  +  P(Xt  -  pXt-\)  +  et  f  =  2,3,  ...,T  (5.35) 

This  transformation,  known  as  the  Cochrane-Orcutt  (1949)  transformation,  reduces  the  dis¬ 
turbances  to  classical  errors.  Therefore,  OLS  on  the  resulting  regression  renders  the  estimates 
BLU,  i.e.,  run  Yt  =  Yt  —  pYt-i  on  a  constant  and  Xt  =  Xt  —  pXt- 1,  for  t  =  2,3, ...  ,T.  Note 
that  we  have  lost  one  observation  by  lagging,  and  the  resulting  estimators  are  BLUE  only  for 
linear  combinations  of  (T  —  1)  observations  in  Y }  Prais  and  Winsten  (1954)  derive  the  BLU 
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estimators  for  linear  combinations  of  T  observations  in  Y.  This  entails  recapturing  the  initial 
observation  as  follows:  (i)  Multiply  the  first  observation  of  the  regression  equation  by  y/\  —  p2; 

\/\  -  p2Yi  =  a\/i  -  p 2  +  / 3y/l  -  p2X i  +  y7!  -  P2u\ 

(ii)  add  this  transformed  initial  observation  to  the  Cochrane-Orcutt  transformed  observations 
for  t  =  2, . . . ,  T  and  run  the  regression  on  the  T  observations  rather  than  the  (T—  1)  observations. 
See  Chapter  9,  for  a  formal  proof  of  this  result.  Note  that 

Y\  =  y/1  -  p2Yi 


and 


Yt  =  Yt-pYt- i  for  f  =  2, . . . ,  T 

Similarly,  X\  =  \/\  —  p2X \  and  Xt  =  Xt  —  pXt-\  for  t  =  2, . . . ,  T.  The  constant  variable  Ct  =  1 
for  t  =  1 , ,T  is  now  a  new  variable  Ct  which  takes  the  values  C\=y/\  —  p 2  and  Ct  =  (1  —  p) 
for  t  =  2, . . . ,  T.  Hence,  the  Prais-Winsten  procedure  is  the  regression  of  Yt  on  Ct  and  Xt 
without  a  constant.  It  is  obvious  that  the  resulting  BLU  estimators  will  involve  p  and  are 
therefore,  different  from  the  usual  OLS  estimators  except  in  the  case  where  p  =  0.  Hence,  OLS 
is  no  longer  BLUE.  Furthermore,  we  need  to  know  p  in  order  to  obtain  the  BLU  estimators. 
In  applied  work,  p  is  not  known  and  has  to  be  estimated,  in  which  case  the  Prais-Winsten 
regression  is  no  longer  BLUE  since  it  is  based  on  an  estimate  of  p  rather  than  the  true  p  itself. 
However,  as  long  as  p  is  a  consistent  estimate  for  p  then  this  is  a  sufficient  condition  for  the 
corresponding  estimates  of  a  and  (3  in  the  next  step  to  be  asymptotically  efficient,  see  Chapter 
9.  We  now  turn  to  various  methods  of  estimating  p. 

(1)  The  Cochrane-Orcutt  (1949)  Method:  This  method  starts  with  an  initial  estimate  of 
p,  the  most  convenient  is  0,  and  minimizes  the  residual  sum  of  squares  in  (5.35).  This  gives  us 
the  OLS  estimates  of  a  and  /3.  Then  we  substitute  aoLS  and  Pols  in  (5-35)  and  we  get 

et  =  pet- 1  +  et  t  =  2,...,T  (5.36) 

where  et  denotes  the  OLS  residual.  An  estimate  of  p  can  be  obtained  by  minimizing  the  residual 
sum  of  squares  in  (5.36)  or  running  the  regression  of  et  on  et- 1  without  a  constant.  The  resulting 
estimate  of  p  is  'pco  =  J2t=2  etet-\l  Ylt=  2  et- 1  where  both  summations  run  over  t  =  2, 3, . . . ,  T. 
The  second  step  of  the  Cochrane-Orcutt  procedure  (2SCO)  is  to  perform  the  regression  in 
(5.35)  with  /5co  instead  of  p.  One  can  iterate  this  procedure  (ITCO)  by  computing  new  residuals 
based  on  the  new  estimates  of  a  and  (3  and  hence  a  new  estimate  of  p  from  (5.36),  and  so  on, 
until  convergence.  Both  the  2SCO  and  the  ITCO  are  asymptotically  efficient,  the  argument  for 
iterating  must  be  justified  in  terms  of  small  sample  gains. 

(2)  The  Hilderth-Lu  (1960)  Search  Procedure:  p  is  between  —1  and  1.  Therefore,  this 
procedure  searches  over  this  range,  i.e. ,  using  values  of  p  say  between  —0.9  and  0.9  in  intervals 
of  0.1.  For  each  p ,  one  computes  the  regression  in  (5.35)  and  reports  the  residual  sum  of  squares 
corresponding  to  that  p.  The  minimum  residual  sum  of  squares  gives  us  our  choice  of  p  and  the 
corresponding  regression  gives  us  the  estimates  of  a,  (3  and  a2.  One  can  refine  this  procedure 
around  the  best  p  found  in  the  first  stage  of  the  search.  For  example,  suppose  that  p  =  0.6  gave 
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the  minimum  residual  sum  of  squares,  one  can  search  next  between  0.51  and  0.69  in  intervals 
of  0.01.  This  search  procedure  guards  against  a  local  minimum.  Since  the  likelihood  in  this  case 
contains  p  as  well  as  a2  and  a  and  (3,  this  search  procedure  can  be  modified  to  maximize  the 
likelihood  rather  than  minimize  the  residual  sum  of  squares,  since  the  two  criteria  are  no  longer 
equivalent.  The  maximum  value  of  the  likelihood  will  give  our  choice  of  p  and  the  corresponding 
estimates  of  a,  [3  and  a2. 

(3)  Durbin’s  (1960)  Method:  One  can  rearrange  (5.35)  by  moving  Y)_ i  to  the  right  hand 
side,  i.e. , 

Yt  =  pYt- 1  +  a(  1  —  p)  +  (3Xt  —  p(3Xt-\  +  et  (5.37) 

and  running  OLS  on  (5.37).  The  error  in  (5.37)  is  classical,  and  the  presence  of  Yt- \  on  the 
right  hand  side  reminds  us  of  the  contemporaneously  uncorrelated  case  discussed  under  the 
violation  of  assumption  4.  For  that  violation,  we  have  shown  that  unbiasedness  is  lost,  but  not 
consistency.  Hence,  the  estimate  of  p  as  a  coefficient  of  Yt- \  is  biased  but  consistent.  This  is 
the  Durbin  estimate  of  p ,  call  it  'pD.  Next,  the  second  step  of  the  Cochrane-Orcutt  procedure 
is  performed  using  this  estimate  of  p. 

(4)  Beach- MacKinnon  (1978)  Maximum  Likelihood  Procedure:  Beach  and  MacKinnon 

(1978)  derived  a  cubic  equation  in  p  which  maximizes  the  likelihood  function  concentrated  with 
respect  to  a,  (3 ,  and  a2.  With  this  estimate  of  p ,  denoted  by  one  performs  the  Prais- 

Winsten  procedure  in  the  next  step. 

Correcting  for  serial  correlation  is  not  without  its  critics.  Mizon  (1995)  argues  this  point 
forcefully  in  his  article  entitled  “A  simple  message  for  autocorrelation  correctors:  Don’t.”  The 
main  point  being  that  serial  correlation  is  a  symptom  of  dynamic  misspecification  which  is 
better  represented  using  a  general  unrestricted  dynamic  specification. 

Monte  Carlo  Results 

Rao  and  Griliches  (1969)  performed  a  Monte  Carlo  study  using  an  autoregressive  Xt,  and 
various  values  of  p.  They  found  that  OLS  is  still  a  viable  estimator  as  long  as  \p\  <  0.3,  but 
if  \p\  >  0.3,  then  it  pays  to  perform  procedures  that  correct  for  serial  correlation  based  on  an 
estimator  of  p.  Their  recommendation  was  to  compute  a  Durbin’s  estimate  of  p  in  the  first  step 
and  to  do  the  Prais-Winsten  procedure  in  the  second  step.  Maeshiro  (1976,  1979)  found  that 
if  the  Xt  series  is  trended,  which  is  usual  with  economic  data,  then  OLS  outperforms  2SCO, 
but  not  the  two-step  Prais-Winsten  (2SPW)  procedure  that  recaptures  the  initial  observation. 
These  results  were  confirmed  by  Park  and  Mitchell  (1980)  who  performed  an  extensive  Monte 
Carlo  using  trended  and  untrended  Xt  s.  Their  basic  findings  include  the  following:  (i)  For 
trended  Xt’s ,  OLS  beats  2SCO,  ITCO  and  even  a  Cochrane-Orcutt  procedure  that  is  based  on 
the  true  p.  However,  OLS  was  beaten  by  2SPW,  iterative  Prais-Winsten  (ITPW),  and  Beach- 
MacKinnon  (BM).  Their  conclusion  is  that  one  should  not  use  regressions  based  on  (T  —  1) 
observations  as  in  Cochrane  and  Orcutt.  (ii)  Their  results  find  that  the  ITPW  procedure  is  the 
recommended  estimator  beating  2SPW  and  BM  for  high  values  of  true  p,  for  both  trended  as 
well  as  nontrended  Xt  s.  (iii)  Test  of  hypotheses  regarding  the  regression  coefficients  performed 
miserably  for  all  estimators  based  on  an  estimator  of  p.  The  results  indicated  less  bias  in 
standard  error  estimation  for  ITPW,  BM  and  2SPW  than  OLS.  However,  the  tests  based  on 
these  standard  errors  still  led  to  a  high  probability  of  type  I  error  for  all  estimation  procedures. 
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Testing  for  Autocorrelation 

So  far,  we  have  studied  the  properties  of  OLS  under  the  violation  of  assumption  3.  We  have 
derived  asymptotically  efficient  estimators  of  the  coefficients  based  on  consistent  estimators  of 
p  and  studied  their  small  sample  properties  using  Monte  Carlo  experiments.  Next,  we  focus  on 
the  problem  of  detecting  this  autocorrelation  between  the  disturbances.  A  popular  diagnostic 
for  detecting  such  autocorrelation  is  the  Durbin  and  Watson  (1951)  statistic2 

d  =  £L2(e*  -  et— i)2/  £L  4  (5-38) 

If  this  was  based  on  the  true  ut  s  and  T  was  very  large  then  d  can  be  shown  to  tend  in  the  limit 
as  T  gets  large  to  2(1  —  p),  see  problem  9.  This  means  that  if  p  — >  0,  then  d  — >  2;  if  p  — »•  1, 
then  d  — »  0  and  if  p  — >  —  1,  then  d  — >  4.  Therefore,  a  test  for  Hq;  p  =  0,  can  be  based  on 
whether  d  is  close  to  2  or  not.  Unfortunately,  the  critical  values  of  d  depend  upon  the  Xt’s,  and 
these  vary  from  one  data  set  to  another.  To  get  around  this,  Durbin  and  Watson  established 
upper  (djj)  and  lower  (d^)  bounds  for  this  critical  value.  If  the  observed  d  is  less  than  dt,,  or 
larger  than  4  —  d^,  we  reject  Hq.  If  the  observed  d  is  between  djj  and  4  —  djj,  then  we  do  not 
reject  Hq.  If  d  lies  in  any  of  the  two  indeterminant  regions,  then  one  should  compute  the  exact 
critical  values  which  depend  on  the  data.  Most  regression  packages  report  the  Durbin- Watson 
statistic,  but  few  give  the  exact  p-value  for  this  d-statistic.  If  one  is  interested  in  a  single  sided 
test,  say  Hq;  p  =  0  versus  Hi;  p  >  0  then  one  would  reject  Hq  if  d  <  di,,  and  not  reject  Hq  if 
d  >  djj-  If  di  <  d  <  djj-,  then  the  test  is  inconclusive.  Similarly  for  testing  Hq;  p  =  0  versus 
Hi;  p  <  0,  one  computes  (4  —  d)  and  follow  the  steps  for  testing  against  positive  autocorrelation. 
Durbin  and  Watson  tables  for  di,  and  d\j  covered  samples  sizes  from  15  to  100  and  a  maximum 
of  5  regressors.  Savin  and  White  (1977)  extended  these  tables  for  6  <  T  <  200  and  up  to  10 
regressors. 

The  Durbin- Watson  statistic  has  several  limitations.  We  discussed  the  inconclusive  region  and 
the  computation  of  exact  critical  values.  The  Durbin- Watson  statistic  is  appropriate  when  there 
is  a  constant  in  the  regression.  In  case  there  is  no  constant  in  the  regression,  see  Farebrother 
(1980).  Also,  the  Durbin- Watson  statistic  is  inappropriate  when  there  are  lagged  values  of  the 
dependent  variable  among  the  regressors.  We  now  turn  to  an  alternative  test  for  serial  correlation 
that  does  not  have  these  limitations  and  that  is  also  easy  to  apply.  This  test  was  derived  by 
Breusch  (1978)  and  Godfrey  (1978)  and  is  known  as  the  Breusch-Godfrey  test  for  zero  first-order 
serial  correlation.  This  is  a  Lagrange  Multiplier  test  that  amounts  to  running  the  regression  of 
the  OLS  residuals  et  on  et~i  and  the  original  regressors  in  the  model.  The  test  statistic  is  TR2. 
Its  distribution  under  the  null  is  Xi-  In  this  case,  the  regressors  are  a  constant  and  Xt,  and  the 
test  checks  whether  the  coefficient  of  et- 1  is  significant.  The  beauty  of  this  test  is  that  (i)  it 
is  the  same  test  for  first-order  serial  correlation,  whether  the  disturbances  are  Moving  Average 
of  order  one  MA(1)  or  AR(1).  (ii)  This  test  is  easily  generalizable  to  higher  autoregressive  or 
Moving  Average  schemes.  For  second-order  serial  correlation,  like  MA(2)  or  AR(2)  one  includes 
two  lags  of  the  residuals  on  the  right  hand  side;  i.e. ,  both  et_i  and  et-2-  (hi)  This  test  is  still 
valid  even  when  lagged  values  of  the  dependent  variable  are  present  among  the  regressors,  see 
Chapter  6.  The  Breusch  and  Godfrey  test  is  standard  using  EViews  and  it  prompts  the  user 
with  a  choice  of  the  number  of  lags  of  the  residuals  to  include  among  the  regressors  to  test 
for  serial  correlation.  You  click  on  residuals,  then  tests  and  choose  Breusch-Godfrey.  Next,  you 
input  the  number  of  lagged  residuals  you  want  to  include. 
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What  about  first  differencing  the  data  as  a  possible  solution  for  getting  rid  of  serial  correla¬ 
tion?  Some  economic  behavioral  equations  are  specified  with  variables  in  first  difference  form, 
like  GDP  growth,  but  other  equations  are  first  differenced  for  estimation  purposes.  In  the  latter 
case,  if  the  original  disturbances  were  not  autocorrelated,  (or  even  correlated,  with  p  /  1),  then 
the  transformed  disturbances  are  serially  correlated.  After  all,  first  differencing  the  disturbances 
is  equivalent  to  setting  p  =  1  in  ut  —  put- 1,  and  this  new  disturbance  u*t  =  ut~  ut-i  has  ut-i  in 
common  with  u*t_l  =  ut- 1  —  ut- 2,  making  =  —E(u2_i)  =  —<J2a.  However,  one  could 

argue  that  if  p  is  large  and  positive,  first  differencing  the  data  may  not  be  a  bad  solution.  Rao 
and  Miller  (1971)  calculated  the  variance  of  the  BLU  estimator  correcting  for  serial  correlation, 
for  various  guesses  of  p.  They  assume  a  true  p  of  0.2,  and  an  autoregressive  Xt 

Xt  =  \Xt-±  +  wt  with  A  =  0, 0.4, 0.8.  (5.39) 

They  find  that  OLS  (or  a  guess  of  p  =  0),  performs  better  than  first  differencing  the  data, 
and  is  pretty  close  in  terms  of  efficiency  to  the  true  BLU  estimator  for  trended  Xt  (A  =  0.8). 
However,  the  performance  of  OLS  deteriorates  as  A  declines  to  0.4  and  0,  with  respect  to  the  true 
BLU  estimator.  This  supports  the  Monte  Carlo  finding  by  Rao  and  Griliches  that  for  \p\  <  0.3, 
OLS  performs  reasonably  well  relative  to  estimators  that  correct  for  serial  correlation.  However, 
the  first-difference  estimator,  i.e. ,  a  guess  of  p  =  1,  performs  badly  for  trended  Xt  (A  =  0.8) 
giving  the  worst  efficiency  when  compared  to  any  other  guess  of  p.  Only  when  the  Xt  s  are 
less  trended  (A  =  0.4)  or  random  (A  =  0),  does  the  efficiency  of  the  first-difference  estimator 
improve.  However,  even  for  those  cases  one  can  do  better  by  guessing  p.  For  example,  for  A  =  0, 
one  can  always  do  better  than  first  differencing  by  guessing  any  positive  p  less  than  1 .  Similarly, 
for  true  p  =  0.6,  a  higher  degree  of  serial  correlation,  Rao  and  Miller  (1971)  show  that  the 
performance  of  OLS  deteriorates,  while  that  of  the  first  difference  improves.  However,  one  can 
still  do  better  than  first  differencing  by  guessing  in  the  interval  (0.4,  0.9).  This  gain  in  efficiency 
increases  with  trended  Xt  s. 

Empirical  Example:  Table  5.3  gives  the  U.S.  Real  Personal  Consumption  Expenditures  (C) 
and  Real  Disposable  Personal  Income  ( Y )  from  the  Economic  Report  of  the  President  over  the 
period  1959-2007.  This  data  set  is  available  as  CONSUMP.DAT  on  the  Springer  web  site. 

The  OLS  regression  yields: 

Ct  =  —1343.31  +  0.979  Yt  +  residuals 
(219.56)  (0.011) 

Figure  5.3  plots  the  actual,  fitted  and  residuals  using  EViews  6.0.  This  shows  positive  serial 
correlation  with  a  string  of  positive  residuals  followed  by  a  string  of  negative  residuals  followed 
by  positive  residuals.  The  Durbin- Watson  statistic  is  d  =  0.181  which  is  much  smaller  than  the 
lower  bound  d  =  1.497  for  T  =  49  and  one  regressor.  Therefore,  we  reject  the  null  hypothesis 
of  Hq  ;  p  =  0  at  the  5%  significance  level. 

The  Breusch  (1978)  and  Godfrey  (1978)  regression  that  tests  for  first-order  serial  correlation 
is  given  in  Table  5.4.  This  is  done  using  EViews  6.0. 

This  yields 

et  =  —54.41  +  0.004  Yt  +  0.909  et_i  +  residuals 
(102.77)  (0.005)  (0.070) 

The  test  statistic  is  TR 2  which  yields  49  x  (0.786)  =  38.5.  This  is  distributed  as  Xi  under  Ho', 
p  =  0.  This  rejects  the  null  hypothesis  of  no  first  order  serial  correlation  with  a  p-value  of  0.0000 
shown  in  Table  5.4. 
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Table  5.3  U.S.  Consumption  Data,  1959-2007 


C  =  Real  Personal  Consumption  Expenditures  (in  1987  dollars) 
Y  =  Real  Disposable  Personal  Income  (in  1987  dollars) 


YEAR 

Y 

C 

1984 

16343 

19011 

1985 

17040 

19476 

1986 

17570 

19906 

1987 

17994 

20072 

1988 

18554 

20740 

1989 

18898 

21120 

1990 

19067 

21281 

1991 

18848 

21109 

1992 

19208 

21548 

1993 

19593 

21493 

1994 

20082 

21812 

1995 

20382 

22153 

1996 

20835 

22546 

1997 

21365 

23065 

1998 

22183 

24131 

1999 

23050 

24564 

2000 

23862 

25472 

2001 

24215 

25697 

2002 

24632 

26238 

2003 

25073 

26566 

2004 

25750 

27274 

2005 

26290 

27403 

2006 

26835 

28098 

2007 

27319 

28614 

YEAR 

Y 

C 

1959 

8776 

9685 

1960 

8837 

9735 

1961 

8873 

9901 

1962 

9170 

10227 

1963 

9412 

10455 

1964 

9839 

11061 

1965 

10331 

11594 

1966 

10793 

12065 

1967 

10994 

12457 

1968 

11510 

12892 

1969 

11820 

13163 

1970 

11955 

13563 

1971 

12256 

14001 

1972 

12868 

14512 

1973 

13371 

15345 

1974 

13148 

15094 

1975 

13320 

15291 

1976 

13919 

15738 

1977 

14364 

16128 

1978 

14837 

16704 

1979 

15030 

16931 

1980 

14816 

16940 

1981 

14879 

17217 

1982 

14944 

17418 

1983 

15656 

17828 

Source:  Economic  Report  of  the  President 


Regressing  the  OLS  residuals  on  their  lagged  values  yields 

et  =  0.906  et-\  +  residuals 
(0.062) 


The  two-step  Cochrane-Orcutt  (1949)  procedure  based  on  p  =  0.906  using  Stata  11  yields  the 
results  given  in  Table  5.5. 

The  Prais-Winsten  (1954)  procedure  using  Stata  11  yields  the  results  given  in  Table  5.6.  The 
estimate  of  the  marginal  propensity  to  consume  is  0.979  for  OLS,  0.989  for  two-step  Cochrane- 
Orcutt,  and  0.912  for  iterative  Prais-Winsten.  All  of  these  estimates  are  significant. 

The  Newey-West  heteroskedasticity  and  autocorrelation-consistent  standard  errors  for  least 
squares  with  a  three-year  lag  truncation  are  given  in  Table  5.7  using  EViews  6.  Note  that  both 
standard  errors  are  now  larger  than  those  reported  by  least  squares.  But  once  again,  this  is  not 
necessarily  the  case  for  other  data  sets. 
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Figure  5.3  Residual  Plot:  Consumption  Regression 


Table  5.4  Breusch-Godfrey  LM  Test 


F-statistic 

Obs*R-squared 

168.9023 

38.51151 

Prob.  F(l,46) 

Prob.  Chi-Square(l) 

0.0000 

0.0000 

Test  Equation: 

Dependent  Variable:  RESID 

Method:  Least  Squares 

Sample  1959  2007 

Included  observations:  49 

Presample  missing  value  lagged  residuals  set  tc 

»  zero 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

-54.41017 

102.7650 

-0.529462 

0.5990 

Y 

0.003590 

0.005335 

0.673044 

0.5043 

RESID  (-1) 

0.909272 

0.069964 

12.99624 

0.0000 

R-squared 

0.785949 

Mean  dependent  var 

-5.34E-13 

Adjusted  R-squared 

0.776643 

S.D.  dependent  var 

433.0451 

S.E.  of  regression 

204.6601 

Akaike  info  criterion 

13.53985 

Sum  squared  resid 

1926746. 

Schwarz  criterion 

13.65567 

Log  likelihood 

-328.7263 

Hannan-Quinn  criter. 

13.58379 

F-statistic 

84.45113 

Durbin- Watson  stat 

2.116362 

Prob(F-statistic) 

0.000000 
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Table  5.5  Cochrane-Orcutt  AR(1)  Regression  -  Twostep 


.  prais  c  y, 

core  two 

Iteration  0: 

:  rho  =  0.0000 

Iteration  1: 

:  rho  =  0.9059 

Cochrane-Orcutt  AR(1)  regression  -  twostep  estimates 

Source 

SS 

df 

MS 

Number  of  obs  = 

48 

F(l,  46) 

519.58 

Model 

17473195 

1 

17473195 

Prob  >  F  = 

0.0000 

Residual 

1546950.74 

46 

33629.364 

R-squared  = 

0.9187 

Adj  R-squared  = 

0.9169 

Total 

19020145.7 

47 

404683.951 

Root  MSE  = 

183.38 

c 

Coef. 

Std.  Err. 

t 

P  >  \t\  [95%  Conf.  Interval] 

y 

.9892295 

.0433981 

22.79 

0.000  .9018738 

1.076585 

_cons 

-1579.722 

1014.436 

-1.56 

0.126  -3621.676 

462.2328 

rho 

.9059431 

Durbin- Watson  statistic  (original)  0.180503 

Durbin- Watson  statistic  (transformed)  2.457550 

Table  5.6 

The  Iterative  Prais-Winsten  AR(1)  Regression 

.  prais  c  y 

Prais-Winsten  AR(1)  regression  iterated  estimates 

Source 

SS 

df 

MS 

Number  of  obs  = 

49 

F(l,  47) 

119.89 

Model 

3916565.48 

1 

3916565.48 

Prob  >  F  = 

0.0000 

Residual 

1535401.45 

47 

32668.1159 

R-squared  = 

0.7184 

Adj  R-squared  = 

0.7124 

Total 

5451966.93 

48 

113582.644 

Root  MSE 

180.74 

c 

Coef. 

Std.  Err. 

t 

P  >  t  [95%  Conf.  Interval] 

y 

.912147 

.047007 

19.40 

0.000  .8175811 

1.006713 

_cons 

358.9638 

1174.865 

0.31 

0.761  -2004.56 

2722.488 

rho 

.9808528 

Durbin- Watson  statistic  (original)  0.180503 
Durbin- Watson  statistic  (transformed)  2.314703 
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Table  5.7  The  Newey-West  HAC  Standard  Errors 


Dependent  Variable: 
Method: 

Sample: 

Included  observations: 


CONSUM 
Least  Squares 
1959  2007 
49 


Newey-West  HAC  Standard  Errors  &  Covariance  (lag  truncation=3) 


Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

-1343.314 

422.2947 

-3.180987 

0.0026 

Y 

0.979228 

0.022434 

43.64969 

0.0000 

R-squared 

0.993680 

Mean  dependent  var 

16749.10 

Adjusted  R-squared 

0.993545 

S.D.  dependent  var 

5447.060 

S.E.  of  regression 

437.6277 

Akaike  info  criterion 

15.04057 

Sum  squared  resid 

9001348. 

Schwarz  criterion 

15.11779 

Log  likelihood 

-366.4941 

Hannan-Quinn  criter. 

15.06987 

F-statistic 

7389.281 

Durbin- Watson  stat 

0.180503 

Prob(F-statistic) 

0.000000 

Notes 

1.  A  computational  warning  is  in  order  when  one  is  applying  the  Cochrane-Orcutt  transformation 
to  cross-section  data.  Time-series  data  has  a  natural  ordering  which  is  generally  lacking  in  cross- 
section  data.  Therefore,  one  should  be  careful  in  applying  the  Cochrane-Orcutt  transformation  to 
cross-section  data  since  it  is  not  invariant  to  the  ordering  of  the  observations. 

2.  Another  test  for  serial  correlation  can  be  obtained  as  a  by-product  of  maximum  likelihood  estima¬ 
tion.  The  maximum  likelihood  estimator  of  p  has  a  normal  limiting  distribution  with  mean  p  and 
variance  (1  —  p2)/T.  Hence,  one  can  compute  ’Pmle/IO-  ~  'Pmle) /T]1/2and  compare  it  to  critical 
values  from  the  normal  distribution. 


Problems 

1.  s2  Is  Biased  Under  Heteroskedasticity.  For  the  simple  linear  regression  with  heteroskedasticity, 
i.e. ,  E(u2)  =  erf,  show  that  E(s2)  is  a  function  of  the  erf’s? 


2.  OLS  Variance  Is  Biased  Under  Heteroskedasticity.  For  the  simple  linear  regression  with  het¬ 
eroskedasticity  of  the  form  E(u f)  =  a2  =  bx2  where  b  >  0,  show  that  E(s2/  J]"=i  xi)  understates 
the  variance  of  Pols  which  is 


£?=i*?^/(£r=i*?)2- 
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3.  Weighted  Least  Squares.  This  is  based  on  Kmenta  (1986). 


(a)  Solve  the  two  equations  in  (5.11)  and  show  that  the  solution  is  given  by  (5.12). 

(b)  Show  that 


var  {(3)  = 


£IU(i/^) 


E?=i  xi/°i\  Er=i(1/f7l)]  -  K.W 
_ £IU< _ 

(£”=i  <^2)(£”=i  <)  -  (£?=  i  <*02 


£r=i<(^-**)2 


where  w*  =  (1  /a2)  and  X*  =  £"=1  W*xi/T,?=i  <■ 


4.  Relative  Efficiency  of  OLS  Under  Heteroskedasticity.  Consider  the  simple  linear  regression  with 
heteroskedasticity  of  the  form  a 2  =  u2Xf  where  X,  =  1,  2, . . . ,  10. 

(a)  Compute  var (f3OLs)  for  6  =  0.5, 1, 1.5  and  2. 

(b)  Compute  var(/3Bi[7B)  for  6  =  0.5, 1, 1.5  and  2. 

(c)  Compute  the  efficiency  of  Pqls  =  yar(/3 blue) /var(/3OLS)  f°r  ^  =  0.5, 1,1.5  and  2.  What 
happens  to  this  efficiency  measure  as  6  increases? 


5.  Consider  the  simple  regression  with  only  a  constant  yi  =  a  +  Ui  for  i  =  1,2,...,  n;  where  the 
itj’s  are  independent  with  mean  zero  and  var(itj)  =  a\  for  i  =  1,2, . . .  ,  rq;  and  var(rtj)  =  a\  for 
i  =  n±  +  1, . . . ,  rii  +  ri2  with  n  =  n\  +  «2- 


(a)  Derive  the  OLS  estimator  of  a  along  with  its  mean  and  variance. 

(b)  Derive  the  GLS  estimator  of  a  along  with  its  mean  and  variance. 

(c)  Obtain  the  relative  efficiency  of  OLS  with  respect  to  GLS.  Compute  their  relative  efficiency 
for  various  values  of  cr \/cr\  =  0.2, 0.4,  0.6, 0.8, 1, 1.25, 1.33, 2.5, 5;  and  ni/n  =  0.2, 0.3,  0.4, . . . , 
0.8.  Plot  this  relative  efficiency. 

(d)  Assume  that  Ui  is  X(0,  af)  for  i  =  1, 2, . . . ,  m;  and  X(0,  a2)  for  i  =  n\  +  1, . . . ,  ni  +  with 
itj’s  being  independent.  What  is  the  maximum  likelihood  estimator  of  a,  a2  and  a21 

(e)  Derive  the  LR  test  for  testing  Hq;  <j\  =  a\  in  part  (d). 

6.  Show  that  for  an  AR(1)  model  given  in  (5.26),  E(utus)  =  for  t,s  =  1,  2, . . . ,  T. 

7.  Relative  Efficiency  of  OLS  Under  the  AR(1 )  Model.  This  problem  is  based  on  Johnston  (1984,  pp. 
310-312).  For  the  simple  regression  without  a  constant  yt  =  /3xt  +  ut  with  ut  =  put_ i  +  et  and 
€t  ~  IID(0,  a2) 


(a)  Show  that 

var(3ois)  = 


£i=l  xt 

+  .  .  . 


1  +  2  p 


J2t=i  xtxt+ 1 

Tt  X2 


2  p 


:  l  xtXt.+ 2 


o„T- 1  Xi  Xt 

2 

£t=l  xt  , 


and  that  the  Prais-Winsten  estimator  /3PW  has  variance 

r  !  -  p2 


var  (/3PW)  — 


e;=c 


l+p2-  2pEt=l  XtXt+l/ £t= 1  : 


These  expressions  are  easier  to  prove  using  matrix  algebra,  see  Chapter  9. 
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(b)  Let  Xt  itself  follow  an  AR(1)  scheme  with  parameter  A,  i.e. ,  Xt  =  Xxt.-i  +  1+  and  let  T  — >  oo. 
Show  that 


asy  eS0OLS) 


Hm  var (J3pw)  = _ 1  -  P2 _ 

t^oo  var(/30iS)  (1  +  p2  —  2pA)(l  +  2p\  +  2p2A“  +  . . .) 

(1  ~  P2)(l  —  pA) 

(1  +  p2  —  2pA)(l  +  pX) 


(c)  Tabulate  this  asy  eff(/3OLg)  for  various  values  of  p  and  A  where  p  varies  between  —0.9  to 
+0.9  in  increments  of  0.1,  while  A  varies  between  0  and  0.9  in  increments  of  0.1.  What  do  you 
conclude?  How  serious  is  the  loss  in  efficiency  in  using  OLS  rather  than  the  PW  procedure? 

(d)  Ignoring  this  autocorrelation  one  would  compute  er 2/  Et= i  xt  as  the  var  {Pol,  s')-  The  differ¬ 
ence  between  this  wrong  formula  and  that  derived  in  part  (a)  gives  us  the  bias  in  estimating 
the  variance  of  Pols-  Show  that  as  T  — >  oo,  this  asymptotic  proportionate  bias  is  given  by 
— 2pA/(l  +  pX).  Tabulate  this  asymptotic  bias  for  various  values  of  p  and  A  as  in  part  (c). 
What  do  you  conclude?  How  serious  is  the  asymptotic  bias  of  using  the  wrong  variances  for 
Pols  when  the  disturbances  are  first-order  autocorrelated? 

(e)  Show  that 


E(s2) 


T-[l  +  2p 


sr~^T —  1 

Et= 1  xtxt+ 1 

ET  o 
t=i*t 


2  ft 


ET — 2 

4=1  xtxt+ 2 


Eh- 


+  . . .  +  2  p 


T—l 


x\ xt 


Eti- 


/(T-l) 


Conclude  that  if  p  =  0,  then  E(s2)  =  u2u.  If  Xt  follows  an  AR(1)  scheme  with  parameter  A, 
then  for  a  large  T,  we  get 


E(s2)  =  a2u  1) 

Compute  this  E{s2)  for  T  =  101  and  various  values  of  p  and  A  as  in  part  (c).  What  do  you 
conclude?  How  serious  is  the  bias  in  using  s2  as  an  unbiased  estimator  for  er2? 


8.  OLS  Variance  Is  Biased  Under  Serial  Correlation.  For  the  AR(1)  model  given  in  (5.26),  show  that 
if  p  >  0  and  the  xt's  are  positively  autocorrelated  that  E(s2 /  5Zxt)  understates  the  var {POLs) 
given  in  (5.34). 

9.  Show  that  for  the  AR(1)  model,  the  Durbin-Watson  statistic  has  plimd  — >  2(1  —  p). 

10.  Regressions  with  Non-zero  Mean  Disturbances.  Consider  the  simple  regression  with  a  constant 


Yi  =  a  +  pXi  +m  i  =  1, 2, . . . ,  n 


where  a  and  P  are  scalars  and  Ui  is  independent  of  the  X^s.  Show  that: 

(a)  If  the  ip’s  are  independent  and  identically  gamma  distributed  with  f(up  =  YW)UBi^1e~Ui 
where  ut  >  0  and  6  >  0,  then  Pols  ~  s2  is  unbiased  for  a. 

(b)  If  the  s  are  independent  and  identically  x2  distributed  with  v  degrees  of  freedom,  then 
Pols  ~  s2 / 2  is  unbiased  for  a. 

(c)  If  the  uP s  are  independent  and  identically  exponentially  distributed  with  f (up  = 
where  ut>  0  and  9  >  0,  then  Pqls  —  s  is  consistent  for  a. 
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11.  The  Heteroskedastic  Consequences  of  an  Arbitrary  Variance  for  the  Initial  Disturbance  of  an  AR(  1 ) 
Model.  This  is  based  on  Baltagi  and  Li  (1990,  1992).  Consider  a  simple  AR(1)  model 

ut  =  put- i  +  et  t=  1, 2, . . . ,  T  \p\  <  1 

with  et  ~  IID(0,  er2)  independent  of  uq  ~  (0 ,  ct2/t),  and  r  is  an  arbitrary  positive  parameter. 

(a)  Show  that  this  arbitrary  variance  on  the  initial  disturbance  uq  renders  the  disturbances,  in 
general,  heteroskedastic. 

(b)  Show  that  var(ut)  =  cr2  is  increasing  if  r  >  (1  —  p2)  and  decreasing  if  r  <  (1  —  p2).  When  is 
the  process  homoskedastic? 

(c)  Show  that  co v(ut,ut-s)  =  pscr2_s  for  t  >  s.  Hint:  See  the  solution  by  Kim  (1991). 

(d)  Consider  the  simple  regression  model 

yt  =  l3xt  +  ut  t  =  1,2  . . .  ,T 

with  ut  following  the  AR(1)  process  described  above.  Consider  the  common  case  where 
p  >  0  and  the  xfs  are  positively  autocorrelated.  For  this  case,  it  is  a  standard  result  that 
the  var(/3 OLS)  understated  under  the  stationary  case  (i.e.,  (1  —  p2)  =  r),  see  problem  8. 
This  means  that  OLS  rejects  too  often  the  hypothesis  Hq\  (3  =  0.  Show  that  OLS  will  reject 
more  often  than  the  stationary  case  if  r  <  1  —  p2  and  less  often  than  the  stationary  case  if 
r  >  (1  —  p2).  Hint:  See  the  solution  by  Koning  (1992). 

12.  ML  Estimation  of  Linear  Regression  Model  with  AR(1)  Errors  and  Two  Observations.  This  is 

based  on  Magee  (1993).  Consider  the  regression  model  ;t/,:  =  Xi/3  +  Ui ,  with  only  two  observations 

i  =  1,2,  and  the  nonstochastic  |cci  |  ^  |rc2 1  are  scalars.  Assume  that  u,  ~  N( 0,  cr2)  and  u2  =  pu±  +  e 

with  \p\  <  1.  Also,  e  ~  N[0,  (1  —  p2)<x2]  where  e  and  iq  are  independent. 

(a)  Show  that  the  OLS  estimator  of  (3  is  {x\y\  +  x2y2)/(x\  +  x%). 

(b)  Show  that  the  ML  estimator  of  (3  is  {x\y\  —  x2y2)/(x2  —  x2). 

(c)  Show  that  the  ML  estimator  of  p  is  2xix2/(x\  +  x\)  and  thus  is  nonstochastic. 

(d)  How  do  the  ML  estimates  of  /3  and  p  behave  as  X\  — >  x2  and  x\  — >  —x2 ?  Assume  x2  ^  0. 
Hint:  See  the  solution  by  Baltagi  and  Li  (1995). 

13.  For  the  empirical  example  in  section  5.5  based  on  the  Cigarette  Consumption  Data  in  Table  3.2. 

(a)  Replicate  the  OLS  regression  of  logC  on  logP,  logF  and  a  constant.  Plot  the  residuals  versus 
logH  and  verify  Figure  5.1. 

(b)  Run  Glejser’s  (1969)  test  by  regressing  \d\  the  absolute  value  of  the  residuals  from  part  (a), 
on  (log^) 5  for  6  =  1,  —1,  —0.5  and  0.5.  Verify  the  t-statistics  reported  in  the  text. 

(c)  Run  Goldfeld  and  Quandt’s  (1965)  test  by  ordering  the  observations  according  to  logV  and 
omitting  12  central  observations.  Report  the  two  regressions  based  on  the  first  and  last  17 
observations  and  verify  the  A-test  reported  in  the  text. 

(d)  Verify  the  Spearman  rank  correlation  test  based  on  the  rank  (logV)  and  rank  |e,|. 

(e)  Verify  Harvey’s  (1976)  multiplicative  heteroskedasticity  test  based  on  regressing  loge2  on 
log  (iogy»). 

(f)  Run  the  Breusch  and  Pagan  (1979)  test  based  on  the  regression  of  e2/a2  on  log Y),  where 

^2  v-^46  2  /  a  a 

a  =Ei=ie-/46- 

(g)  Run  White’s  (1980)  test  for  heteroskedasticity. 

(h)  Run  the  Jarque  and  Bera  (1987)  test  for  normality  of  the  disturbances. 

(i)  Compute  White’s  (1980)  heteroskedasticity  robust  standard  errors  for  the  regression  in  part  (a) . 
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14.  A  Simple  Linear  Trend  Model  with  AR(1)  Disturbances.  This  is  based  on  Kramer  (1982). 

(a)  Consider  the  following  simple  linear  trend  model 

Yt  =  ol  +  (dt  +  Ut 

where  ut  =  put- 1  +  et  with  |p|  <  1,  et  ~  IID(0,  cr2)  and  var(ut)  =  er2  =  er2/(  1  —  p 2).  Our 
interest  is  focused  on  the  estimates  of  the  trend  coefficient,  (3,  and  the  estimators  to  be 
considered  are  OLS,  CO  (assuming  that  the  true  value  of  p  is  known),  the  first-difference 
estimator  (FD),  and  the  Generalized  Least  Squares  (GLS),  which  is  Best  Linear  Unbiased 
(BLUE)  in  this  case. 

In  the  context  of  the  simple  linear  trend  model,  the  formulas  for  the  variances  of  these 
estimators  reduce  to 

V(OLS)  =  12ct2{— 6pT+1[(T  -  1  )p  -  (T  +  l)]2  -  (T3  -  T)p 4 

+2(T2  -  1)(T  -  3)p3  +  12 (T2  +  1  )p2  -  2 (T2  -  1)(T  +  3 )p 
+  (T3  —  T)}/(1  —  p2)(l  —  p)4(T3  —  T)2 
V{CO)  =  12<72(l-p)2(T3-3T2  +  2T), 

V(FD)  =  2cr2(l  —  pT-1)/(l  —  p2)(T  —  l)2, 

V(GLS)  =  12er2/(T  -  1) [(T  —  3)(T  -  2)p2  -  2(T  —  3)(T  —  l)p  +  T(T  +  1)] . 

(b)  Compute  these  variances  and  their  relative  efficiency  with  respect  to  the  GLS  estimator  for 
T  =  10,20,30,40  and  p  between  —0.9  and  0.9  in  0.1  increments. 

(c)  For  a  given  T,  show  that  the  limit  of  var  (OLS) /var  (CO)  is  zero  as  p  — *  1.  Prove  that 

var (FD)  and  var (GLS)  both  tend  in  the  limit  to  er2/(T  —  1)  <  oo  as  p  — >  1.  Conclude 

that  var  (G  L  S)  /  var  (FD)  tend  to  1  as  p  — >  1.  Also,  show  that  lim  \var  (GLS) /var  (OLS)]  = 

p->  i 

5 (T2  +  T)/6(T2  +  1)  <  1  provided  T  >  3. 

(d)  For  a  given  p,  show  that  var  (FD)  =  0{T~2)  whereas  the  variance  of  the  remaining  estimators 
is  0{T~3).  Conclude  that  lim  [var(FD)  j var(CO)]  =  oo  for  any  given  p. 

T— >oo 

15.  Consider  the  empirical  example  in  section  5.6,  based  on  the  Consumption-Income  data  in  Table  5.3. 
Obtain  this  data  set  from  the  CONSUMP.DAT  file  on  the  Springer  web  site. 

(a)  Replicate  the  OLS  regression  of  C*  on  Yt  and  a  constant,  and  compute  the  Durbin- Watson 
statistic.  Test  L[0;  p  =  0  versus  Hp,  p  >  0  at  the  5%  significance  level. 

(b)  Test  for  first-order  serial  correlation  using  the  Breusch  and  Godfrey  test. 

(c)  Perform  the  two-step  Cochrane-Orcutt  procedure  and  verify  Table  5.5.  What  happens  if  we 
iterate  the  Cochrane-Orcutt  procedure? 

(d)  Perform  the  Prais-Winsten  procedure  and  verify  Table  5.6. 

(e)  Compute  the  Newey-West  heteroskedasticity  and  autocorrelation-consistent  standard  errors 
for  the  least  squares  estimates  in  part  (a). 

16.  Benderly  and  Zwick  (1985)  considered  the  following  equation 


RSf  —  a  +  PQt+i  +  7  Pt  +  ut 


where  RSt  =  the  real  return  on  stocks  in  year  t,  Qt+i  =  the  annual  rate  of  growth  of  real  GNP 
in  year  t  +  1,  and  Pt  =  the  rate  of  inflation  in  year  t.  The  data  is  provided  on  the  Springer  web 
site  and  labeled  BENDERLY.ASC.  This  data  covers  31  annual  observations  for  the  U.S.  over  the 
period  1952-1982.  This  was  obtained  from  Lott  and  Ray  (1991).  This  equation  is  used  to  test  the 
significance  of  the  inflation  rate  in  explaining  real  stock  returns.  Use  the  sample  period  1954-1976 
to  answer  the  following  questions: 
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(a)  Run  OLS  to  estimate  the  above  equation.  Remember  to  use  Qt+i-  Is  Pt  significant  in  this 
equation?  Plot  the  residuals  against  time.  Compute  the  Newey-West  heteroskedasticity  and 
autocorrelation-consistent  standard  errors  for  these  least  squares  estimates. 

(b)  Test  for  serial  correlation  using  the  D.W.  test. 

(c)  Would  your  decision  in  (b)  change  if  you  used  the  Breusch-Godfrey  test  for  first-order  serial 
correlation? 

(d)  Run  the  Cochrane-Orcutt  procedure  to  correct  for  first-order  serial  correlation.  Report  your 
estimate  of  p. 

(e)  Run  a  Prais-Winsten  procedure  accounting  for  the  first  observation  and  report  your  estimate 
of  p.  Plot  the  residuals  against  time. 

17.  Using  our  cross-section  Energy/GDP  data  set  in  Chapter  3,  problem  3.16  consider  the  following 
two  models: 

Model  1:  log  En  =  a  +  /Slog  RGDP  +  u 
Model  2:  En  —  a  +  fSRGDP  +  v 

Make  sure  you  have  corrected  the  W.  Germany  observation  on  EN  as  described  in  problem  3.16 
part  (d). 

(a)  Run  OLS  on  both  Models  1  and  2.  Test  for  heteroskedasticity  using  the  Goldfeldt/Quandt 
Test.  Omit  c  =  6  central  observations.  Why  is  heteroskedasticity  a  problem  in  Model  2,  but 
not  Model  1? 

(b)  For  Model  2,  test  for  heteroskedasticity  using  the  Glejser  Test. 

(c)  Now  use  the  Breusch-Pagan  Test  to  test  for  heteroskedasticity  on  Model  2. 

(d)  Apply  White’s  Test  to  Model  2. 

(e)  Do  all  these  tests  give  the  same  decision? 

(f)  Propose  and  estimate  a  simple  transformation  of  Model  2,  assuming  heteroskedasticity  of 
the  form  erf  =  a2 RGDP2. 

(g)  Propose  and  estimate  a  simple  transformation  of  Model  2,  assuming  heteroskedasticity  of 
the  form  a2  =  a2  (a  +  bRGDP)2 . 

(h)  Now  suppose  that  heteroskedasticity  is  of  the  form  erf  =  a2  RGDP1  where  7  is  an  unknown 
parameter.  Propose  and  estimate  a  simple  transformation  for  Model  2.  Hint:  You  can  write 
erf  as  exp{a  +  ylog RGDP}  where  a  =  logcr2. 

(i)  Compare  the  standard  errors  of  the  estimates  for  Model  2  from  OLS,  also  obtain  White’s 
heteroskedasticity-consistent  standard  errors.  Compare  them  with  the  simple  Weighted  Least 
Squares  estimates  of  the  standard  errors  in  parts  (f),  (g)  and  (h).  What  do  you  conclude? 

18.  You  are  given  quarterly  data  from  the  first  quarter  of  1965  (1965.1)  to  the  fourth  quarter  of  1983 
(1983.4)  on  employment  in  Orange  County  California  (EMP)  and  real  gross  national  product 
(RGNP).  The  data  set  is  in  a  file  called  ORANGE.DAT  on  the  Springer  web  site. 

(a)  Generate  the  lagged  variable  of  real  GNP,  call  it  RGNPt-i  and  estimate  the  following  model 
by  OLS:  EMPt  =  a  +  (SRGNPt_ r  +  ut. 

(b)  What  does  inspection  of  the  residuals  and  the  Durbin- Watson  statistic  suggest? 

(c)  Assuming  ut  =  put-i  +  et  where  \p\  <  1  and  et  ~  IIN(0,crf),  use  the  Cochrane-Orcutt 
procedure  to  estimate  p,  a  and  (S.  Compare  the  latter  estimates  and  their  standard  errors 
with  those  of  OLS. 

(d)  The  Cochrane-Orcutt  procedure  omits  the  first  observation.  Perform  the  Prais-Winsten  ad¬ 
justment.  Compare  the  resulting  estimates  and  standard  error  with  those  in  part  (c). 
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(e)  Apply  the  Breusch-Godfrey  test  for  first  and  second  order  autoregression.  What  do  you 
conclude? 

(f)  Compute  the  Newey-West  heteroskedasticity  and  autocorrelation-consistent  covariance  stan¬ 
dard  errors  for  the  least  squares  estimates  in  part  (a). 

19.  Consider  the  earning  data  underlying  the  regression  in  Table  4.1  and  available  on  the  Springer 
web  site  as  EARN.ASC. 

(a)  Apply  White’s  test  for  heteroskedasticity  to  the  regression  residuals. 

(b)  Compute  White’s  heteroskedasticity-consistent  standard  errors. 

(c)  Test  the  least  squares  residuals  for  normality  using  the  Jarque-Bera  test. 

20.  Hedonic  Housing.  Harrison  and  Rubinfield  (1978)  collected  data  on  506  census  tracts  in  the  Boston 
area  in  1970  to  study  hedonic  housing  prices  and  the  willingness  to  pay  for  clean  air.  This 
data  is  available  on  the  Springer  web  site  as  HEDONIC.XLS.  The  dependent  variable  is  the 
Median  Value  (MV)  of  owner-occupied  homes.  The  regressors  include  two  structural  variables, 
RM  the  average  number  of  rooms,  and  AGE  representing  the  proportion  of  owner  units  built  prior 
to  1940.  In  addition  there  are  eight  neighborhood  variables:  B,  the  proportion  of  blacks  in  the 
population;  LSTAT,  the  proportion  of  population  that  is  lower  status;  CRIM,  the  crime  rate;  ZN, 
the  proportion  of  25000  square  feet  residential  lots;  INDUS,  the  proportion  of  nonretail  business 
acres;  TAX,  the  full  value  property  tax  rate  ($/$10000);  PTRATIO,  the  pupil-teacher  ratio;  and 
CHAS  represents  the  dummy  variable  for  Charles  River:  =  1  if  a  tract  bounds  the  Charles.  There 
are  also  two  accessibility  variables,  DIS  the  weighted  distances  to  five  employment  centers  in  the 
Boston  region,  and  RAD  the  index  of  accessibility  to  radial  highways.  One  more  regressor  is  an 
air  pollution  variable  NOX,  the  annual  average  nitrogen  oxide  concentration  in  parts  per  hundred 
million. 

(a)  Run  OLS  of  MV  on  the  13  independent  variables  and  a  constant.  Plot  the  residuals. 

(b)  Apply  White’s  tests  for  heteroskedasticity. 

(c)  Obtain  the  White  heteroskedasticity-consistent  standard  errors. 

(d)  Test  the  least  squares  residuals  for  normality  using  the  Jarque-Bera  test. 

21.  Agglomeration  Economies,  Diseconomies,  and  Growth.  Wheeler  (2003)  uses  data  on  3106  counties 
of  the  contiguous  USA  to  fit  a  fourth-order  polynomial  relating  County  population  (employment) 
growth  (over  the  period  1980  to  1990)  as  a  function  of  log(size),  where  size  is  measured  as  total 
resident  population  or  total  civilian  employment.  Other  control  variables  include  the  proportion 
of  the  adult  resident  population  (i.e.  of  age  25  or  older)  with  a  bachelor’s  degree  or  more;  the 
proportion  of  total  employment  in  manufacturing;  and  the  unemployment  rate,  all  for  the  year 
1980;  Per  capita  income  in  1979;  the  proportion  of  the  resident  population  belonging  to  non¬ 
white  racial  categories  in  1980,  and  the  share  of  local  government  expenditures  going  to  each  of 
three  public  goods-education,  roads  and  highways,  police  protection-in  1982.  This  data  can  be 
downloaded  from  the  JAE  archive  data  web  site. 

(a)  Replicate  the  OLS  regressions  reported  in  Tables  VIII  and  IX  of  Wheeler  (2003,  pp.  88-89). 

(b)  Apply  White’s  and  Breusch-Pagan  tests  for  heteroskedasticity. 

(c)  Test  the  least  squares  residuals  for  normality  using  the  Jarque-Bera  test. 
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CHAPTER  6 


Distributed  Lags  and  Dynamic  Models 


6.1  Introduction 

Many  economic  models  have  lagged  values  of  the  regressors  in  the  regression  equation.  For 
example,  it  takes  time  to  build  roads  and  highways.  Therefore,  the  effect  of  this  public  investment 
on  growth  in  GNP  will  show  up  with  a  lag,  and  this  effect  will  probably  linger  on  for  several 
years.  It  takes  time  before  investment  in  research  and  development  pays  off  in  new  inventions 
which  in  turn  take  time  to  develop  into  commercial  products.  In  studying  consumption  behavior, 
a  change  in  income  may  affect  consumption  over  several  periods.  This  is  true  in  the  permanent 
income  theory  of  consumption,  where  it  may  take  the  consumer  several  periods  to  determine 
whether  the  change  in  real  disposable  income  was  temporary  or  permanent.  For  example,  is 
the  extra  consulting  money  earned  this  year  going  to  continue  next  year?  Also,  lagged  values 
of  real  disposable  income  appear  in  the  regression  equation  because  the  consumer  takes  into 
account  his  life  time  earnings  in  trying  to  smooth  out  his  consumption  behavior.  In  turn,  one’s 
life  time  income  may  be  guessed  by  looking  at  past  as  well  as  current  earnings.  In  other  words, 
the  regression  relationship  would  look  like 


Yt  —  a  +  PqXi  +  fiiXt— i  +  ..  +  (3sXt-s  +  ut  t  —  1,  2, . . . ,  T  (6-1) 

where  Yj  denotes  the  t- th  observation  on  the  dependent  variable  Y  and  Xt-S  denotes  the  (f-s)th 
observation  on  the  independent  variable  A.  a  is  the  intercept  and  /30,  /31, . . . ,  (5S  are  the  current 
and  lagged  coefficients  of  Xt.  Equation  (6.1)  is  known  as  a  distributed  lag  since  it  distributes  the 
effect  of  an  increase  in  income  on  consumption  over  s  periods.  Note  that  the  short-run  effect  of 
a  unit  change  in  X  on  Y  is  given  by  (3a,  while  the  long-run  effect  of  a  unit  change  in  X  on  Y 

is  (/30  +  Pi  +  +  (3S). 

Suppose  that  you  observe  Xt  from  1959  to  2007.  Xt-\  is  the  same  variable  but  for  the  previous 
period,  i.e. ,  1958-2006.  Since  1958  is  not  available  in  this  data,  the  software  you  are  using  will 
start  from  1959  for  Xt- 1,  and  end  at  2006.  This  means  that  when  we  lag  once,  the  current 
Xt  series  will  have  to  start  at  1960  and  end  at  2007.  For  practical  purposes,  this  means  that 
when  we  lag  once  we  loose  one  observation  from  the  sample.  So  if  we  lag  s  periods,  we  loose 
s  observations.  Furthermore,  we  are  estimating  one  extra  f3  with  every  lag.  Therefore,  there 
is  double  jeopardy  with  respect  to  loss  of  degrees  of  freedom.  The  number  of  observations  fall 
(because  we  are  lagging  the  same  series),  and  the  number  of  parameters  to  be  estimated  increase 
with  every  lagging  variable  introduced.  Besides  the  loss  of  degrees  of  freedom,  the  regressors 
in  (6.1)  are  likely  to  be  highly  correlated  with  each  other.  In  fact  most  economic  time  series 
are  usually  trended  and  very  highly  correlated  with  their  lagged  values.  This  introduces  the 
problem  of  among  the  regressors  and  as  we  saw  in  Chapter  4,  the  higher  the  multicollinearity 
among  these  regressors,  the  lower  is  the  reliability  of  the  regression  estimates. 

In  this  model,  OLS  is  still  BLUE  because  the  classical  assumptions  are  still  satisfied.  All  we 
have  done  in  (6.1)  is  introduce  the  additional  regressors  (Xt- 1, . . . ,  X t-s)-  These  regressors  are 
uncorrelated  with  the  disturbances  since  they  are  lagged  values  of  Xt,  which  are  by  assumption 
not  correlated  with  ut  for  every  t. 
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In  order  to  reduce  the  degrees  of  freedom  problem,  one  could  impose  more  structure  on  the  P’s. 
One  of  the  simplest  forms  imposed  on  these  coefficients  is  the  linear  arithmetic  lag ,  (see  Figure 
6.1),  which  can  be  written  as 

/3j  =  [(s  +  1)  —  i]/3  for  i  =  0, 1, . . . ,  s  (6.2) 

The  lagged  coefficients  of  X  follow  a  linear  distributed  lag  declining  arithmetically  from  (s  + 1)/3 
for  Xt  to  (3  for  Xt~s ■  Substituting  (6.2)  in  (6.1)  one  gets 

Yt  =  a  +  ^2i= o  PiXt-i  +  ut  =  a  +  /3  5^®_q[(s  T  1)  —  i\Xt~i  +  ut  (6-3) 

where  the  latter  equation  can  be  estimated  by  the  regression  of  Yt  on  a  constant  and  Zt,  where 

zt  =  Y^Si=0[(s  +  l) -i]Xt-i 

This  Zt  can  be  calculated  given  s  and  Xt.  Hence,  we  have  reduced  the  estimation  of  P0,  Pi,  ■  ■  ■ ,  Ps 
into  the  estimation  of  just  one  p.  Once  P  is  obtained,  Pi  can  be  deduced  from  (6.2),  for  i  = 
0, 1, . . . ,  s.  Despite  its  simplicity,  this  lag  is  too  restrictive  to  impose  on  the  regression  and  is 
not  usually  used  in  practice. 

Alternatively,  one  can  think  of  /3j  =  f(i )  for  i  =  0, 1, . . . ,  s.  If  f(i)  is  a  continuous  function, 
over  a  closed  interval,  then  it  can  be  approximated  by  an  r-th  degree  polynomial, 

f(i)  =  a0  +  a\i  +  . . .  +  arir 

For  example,  if  r  =  2,  then 

Pt  =  oq  +  a\i  +  02 i2  for  *  =  0, 1, 2, . . . ,  s 
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so  that 

Po  =  °0 

Pi  =  a0  +  ai  +  a2 

P2  =  T  2ai  +  4a2 

Ps  =  a0  +  sai  +  s2a2 

Once  ao,ai,  and  a2  are  estimated,  P0,  P1, . . . ,  Ps  can  be  deduced.  In  fact,  substituting  /3j  = 
ao  +  a\i  +  a2i 2  in  (6.1)  we  get 

Yt  =  a  +  Si=o(°o  +  +  «2  i2)Xt-i  +  ut  (6.4) 

=  a  +  ao  X^£=o  Xt—i  +  ai  Yli=o  ^ t—i  +  &2  Si=o  i  +  a* 

This  last  equation,  shows  that  a,  ao,  ai  and  a2  can  be  estimated  from  the  regression  of  Yt 
on  a  constant,  Zo  =  Yhl=o Z\  =  X^i=o  ^t-i  and  Z2  =  YH=o^2  Xt-i-  This  procedure  was 
proposed  by  Almon  (1965)  and  is  known  as  the  Almon  lag.  One  of  the  problems  with  this 
procedure  is  the  choice  of  s  and  r,  the  number  of  lags  on  Xt,  and  the  degree  of  the  polynomial, 
respectively.  In  practice,  neither  is  known.  Davidson  and  MacKinnon  (1993)  suggest  starting 
with  a  maximum  reasonable  lag  s*  that  is  consistent  with  the  theory  and  then  based  on  the 
unrestricted  regression,  given  in  (6.1),  checking  whether  the  fit  of  the  model  deteriorates  as  s* 
is  reduced.  Some  criteria  suggested  for  this  choice  include:  (i)  maximizing  R 2;  (ii)  minimizing 
Akaike’s  (1973)  Information  Criterion  (AIC)  with  respect  to  s.  This  is  given  by  AIC(s)  = 
(RSS/T)e2s/T ;  or  (iii)  minimizing  Schwarz  (1978)  Bayesian  Information  Criterion  (BIC)  with 
respect  to  s.  This  is  given  by  BIC(s)  =  (RSS /T)Ts/t  where  RSS  denotes  the  residual  sum 
of  squares.  Note  that  the  AIC  and  BIC  criteria,  like  R 2,  reward  good  fit  but  penalize  loss 
of  degrees  of  freedom  associated  with  a  high  value  of  s.  These  criteria  are  printed  by  most 
regression  software  including  SHAZAM,  EViews  and  SAS.  Once  the  lag  length  s  is  chosen  it  is 
straight  forward  to  determine  r,  the  degree  of  the  polynomial.  Start  with  a  high  value  of  r  and 
construct  the  Z  variables  as  described  in  (6.4).  If  r  =  4  is  the  highest  degree  polynomial  chosen 
and  04,  the  coefficient  of  Z4  =  Yli=o  *4^i-4  is  insignificant,  drop  Z4  and  run  the  regression  for 
r  =  3.  Stop,  if  the  coefficient  of  Z3  is  significant,  otherwise  drop  Z3  and  run  the  regression  for 
r  =  2. 

Applied  researchers  usually  impose  end  point  constraints  on  this  Almon  lag.  A  near  end 
point  constraint  means  that  /3_1  =  0  in  equation  (6.1).  This  means  that  for  equation  (6.4), 
this  constraint  yields  the  following  restriction  on  the  second  degree  polynomial  in  a’s:  P_i  = 
/(— 1)  =  ao  —  ai  +  a2  =  0.  This  restriction  allows  us  to  solve  for  ao  given  ai  and  a2.  In  fact, 
substituting  ao  =  a\  —  a2  into  (6.4),  the  regression  becomes 

Yt  =  ol  +  a±(Zi  +  Zq)  +  a2(Z2  —  Zo)  +  ut  (6-5) 

and  once  ai  and  a2  are  estimated,  ao  is  deduced,  and  hence  the  Pf  s.  This  restriction  essentially 
states  that  Xt+i  has  no  effect  on  Yt.  This  may  not  be  a  plausible  assumption,  especially  in  our 
consumption  example,  where  income  next  year  enters  the  calculation  of  permanent  income  or 
life  time  earnings.  A  more  plausible  assumption  is  the  far  end  point  constraint,  where  Ps+\  =  0. 
This  means  that  At_(-S+1)  does  not  affect  Yt-  The  further  you  go  back  in  time,  the  less  is  the 
effect  on  the  current  period.  All  we  have  to  be  sure  of  is  that  we  have  gone  far  back  enough 
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Figure  6.2  A  Polynomial  Lag  with  End  Point  Constraints 

to  reach  an  insignificant  effect.  This  far  end  point  constraint  is  imposed  by  removing 
from  the  equation  as  we  have  done  above.  But,  some  researchers  impose  this  restriction  on 
/3j  =  f(i),  i.e.,  by  restricting  /?s+1  =  f(s  +  1)  =  0.  This  yields  for  r  =  2  the  following  constraint: 
flo  +  (s  +  l)oi  +  (s  +  l)2a2  =  0.  Solving  for  ao  and  substituting  in  (6.4),  the  constrained  regression 
becomes 

Yt  =  a  +  a\[Z\  —  (s  +  1)Z0]  +  02(^2  —  (s  +  1)2Zq]  +  ut  (6.6) 

One  can  also  impose  both  end  point  constraints  and  reduce  the  regression  into  the  estimation 
of  one  a  rather  than  three  a’s.  Note  that  /3_1  =  (5S+ 1  =  0  can  be  imposed  by  not  including 
Xt+ 1  and  in  the  regression  relationship.  However,  these  end  point  restrictions  impose 

the  additional  restrictions  that  the  polynomial  on  which  the  a’s  lie  should  pass  through  zero  at 
i  =  —  1  and  i  =  (s  +  1),  see  Figure  6.2. 

These  additional  restrictions  on  the  polynomial  may  not  necessarily  be  true.  In  other  words, 
the  polynomial  could  intersect  the  X-axis  at  points  other  than  —1  or  (s  +  1).  Imposing  a 
restriction,  whether  true  or  not,  reduces  the  variance  of  the  estimates,  and  introduces  bias  if  the 
restriction  is  untrue.  This  is  intuitive,  because  this  restriction  gives  additional  information  which 
should  increase  the  reliability  of  the  estimates.  The  reduction  in  variance  and  the  introduction  of 
bias  naturally  lead  to  Mean  Square  Error  criteria  that  help  determine  whether  these  restrictions 
should  be  imposed,  see  Wallace  (1972).  These  criteria  are  beyond  the  scope  of  this  chapter.  In 
general,  one  should  be  careful  in  the  use  of  restrictions  that  may  not  be  plausible  or  even  valid.  In 
fact,  one  should  always  test  these  restrictions  before  using  them.  See  Schmidt  and  Waud  (1975). 

Empirical  Example:  Using  the  Consumption-Income  data  from  the  Economic  Report  of  the 
President  over  the  period  1959-2007,  given  in  Table  5.3,  we  estimate  a  consumption-income 
regression  imposing  a  five  year  lag  on  income.  In  this  case,  all  variables  are  in  log  and  s  =  5 
in  equation  (6.1).  Table  6.1  gives  the  Stata  output  imposing  the  linear  arithmetic  lag  given  in 
equation  (6.2). 

The  regression  output  reports  /3  =  0.0498  which  is  statistically  significant  with  a  t- value 
of  64.4.  One  can  test  the  arithmetic  lag  restrictions  jointly  using  an  E-test.  The  Unrestricted 
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Table  6.1  Regression  with  Arithmetic  Lag  Restriction 


.  tsset  year 

time  variable:  year,  1959  to  2007 

delta:  1  unit 

.  gen  ly=ln(y) 

.  gen  lc=ln(c) 

.  gen  z=6*ly+5*l.ly+4*12.1y+3*13.1y+2*14.1y+15.1y 
(5  missing  values  generated) 

.  reg  lc  z 


Source 

SS 

df 

MS 

Number  of  obs  = 

44 

Model 

3.5705689 

1 

3.5705689 

F(l,  42) 

Prob  >  F  = 

4149.96 

0.0000 

Residual 

.036136249 

42 

.000860387 

R-squared  = 

0.9900 

Total 

3.60670515 

43 

.083876864 

Adj  R-squared  = 

Root  MSE 

0.9897 

.02933 

lc 

Coef. 

Std.  Err. 

t 

P  >  \t\  [95%  Conf.  Interval] 

z 

.049768 

.0007726 

64.42 

0.000  .0482089 

.0513271 

_cons 

-.5086255 

.1591019 

-3.20 

0.003  -.8297061 

-.1875449 

Table  6.2 

Almon  Polynomial, 

r  =  2,  s  = 

5  and  Near  End-Point  Constraint 

Dependent  Variable  =  LNC 

Method:  Least  Squares 

Sample  (adjusted):  1964  2007 

Included  observations:  44  after  adjustments 


Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

-0.770611 

0.201648 

3.821563 

0.0004 

PDL01 

0.342152 

0.056727 

6.031589 

0.0000 

PDL02 

0.067215 

0.012960 

-5.186494 

0.0000 

R-squared 

0.990054 

Mean  dependent  var 

9.736786 

Adjusted  R-squared 

0.989568 

S.D.  dependent  var 

0.289615 

S.E.  of  regression 

0.029580 

Akaikc  info  criterion 

-4.137705 

Sum  squared  resid 

0.035874 

Schwarz  criterion 

-4.016055 

Log  likelihood 

94.02950 

Hannan-Quinn  criter. 

-4.092591 

F-statistic 

2040.559 

Durbin- Watson  stat 

0.382851 

Prob(F-statistic) 

0.000000 

Lag  Distribution  of  LNY 

i 

Coefficient 

Std.  Error 

t-Statistic 

.  * 

0 

0.27494 

0.04377 

6.28161 

.  *| 

1 

0.41544 

0.06162 

6.74167 

.  *| 

2 

0.42152 

0.05358 

7.86768 

.  * 

3 

0.29317 

0.01976 

14.8332 

★ 

4 

0.03039 

0.04056 

0.74937 

★ 

5 

0.36682 

0.12630 

2.90445 

Sum  of  Lags 

1.06865 

0.01976 

54.0919 
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Residual  Sum  of  Squares  (URSS)  is  obtained  by  regressing  Ct  on  Yt,  Yj_i, . . . ,  Yt- 5  and  a  con¬ 
stant.  This  yields  URSS  =  0.016924.  The  RRSS  is  given  in  Table  6.1  as  0.036136  and  it  involves 
imposing  5  restrictions  given  in  (6.2).  Therefore, 

_  (0.036136249  -  0.016924337)/5  _  o 
“  0.016924337/37  ™ b'  ® 

and  this  is  distributed  as  -F5.37  under  the  null  hypothesis.  This  rejects  the  linear  arithmetic  lag 
restrictions. 

Next  we  impose  an  Almon  lag  based  on  a  second  degree  polynomial  as  described  in  equation 
(6.4).  Table  6.2  reports  the  EViews  output  for  s  =  5  imposing  the  near  end  point  constraint. 
To  do  this  using  EViews,  one  replaces  the  regressor  Y  by  PDL(Y,  5,2,1)  indicating  a  request 
to  fit  a  five  year  Almon  lag  on  Y  that  is  of  the  second-order  degree,  with  a  near  end  point 
constraint.  In  this  case,  the  estimated  regression  coefficients  rise  and  then  fall  becoming  negative: 
/30  =  0.275,  P1  =  0.415, . . .  ,/35  =  —0.367.  Note  that  /34  is  statistically  insignificant.  The  Almon 
lag  restrictions  can  be  jointly  tested  using  Chow’s  E-statistic.  The  URSS  is  obtained  from  the 
unrestricted  regression  of  Ct  on  Yt,Yt- 1, . . . ,  Yj_ 5  and  a  constant.  This  was  reported  above  as 
URSS  =  0.016924. 


Table  6.3  Almon  Polynomial,  r  =  2,s  =  5  and  Far  End-Point  Constraint 


Dependent  Variable  =  LNC 

Method:  Least  Squares 

Sample  (adjusted):  1964  2007 

Included  observations:  44  after  adjustments 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

-1.107052 

0.158405 

-6.988756 

0.0000 

PDL01 

0.134206 

0.011488 

11.68247 

0.0000 

PDL02 

-0.259490 

0.036378 

7.133152 

0.0000 

R-squared 

0.994467 

Mean  dependent  var 

9.736786 

Adjusted  R-squared 

0.994197 

S.D.  dependent 

var 

0.289615 

S.E.  of  regression 

0.022062 

Akaike  info  criterion 

-4.724198 

Sum  squared  resid 

0.019956 

Schwarz  criterion 

-4.602548 

Log  likelihood 

106.9324 

Hannan-Quinn  criter. 

-4.679084 

F-statistic 

3684.610 

Durbin- Watson  stat 

0.337777 

Prob(F-statistic) 

0.000000 

Lag  Distribution  of  LNY 

i 

Coefficient 

Std.  Error 

t-Statistic 

.  *| 

0 

0.87912 

0.10074 

8.72642 

.  ★ 

1 

0.45018 

0.03504 

12.8475 

.  * 

2 

0.13421 

0.01149 

11.6825 

*. 

3 

-0.06880 

0.03787 

-1.81686 

★  . 

4 

-0.15884 

0.04483 

-3.54337 

★  . 

5 

-0.13590 

0.03221 

-4.21962 

Sum  of  Lags 

1.09997 

0.01547 

71.0964 
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The  RRSS,  given  in  Table  6.2,  is  0.035874  and  involves  four  restrictions.  Therefore, 

„  (0.03587367  -  0.016924337) /4  „ 

F  —  —  10. 3o  i 

0.016924337/37 

and  this  is  distributed  as  T4.37  under  the  null  hypothesis.  This  rejects  the  second  degree  poly¬ 
nomial  Alrnon  lag  specification  with  a  near  end  point  constraint. 

Table  6.3  reports  the  EViews  output  for  s  =  5,  imposing  the  far  end  point  constraint.  To  do 
this  using  EViews,  one  replaces  the  regressor  Y  by  PDL(K,  5,  2,  2)  indicating  a  request  to  fit  a  five 
year  Alrnon  lag  on  Y  that  is  of  the  second-order  degree,  with  a  far  end  point  constraint.  In  this 
case,  the  /3’s  are  positive,  then  becoming  negative,  j30  =  0.879,  /31  =  0.450, . . .  ,P5  =  —0.136,  all 
being  statistically  significant.  This  second  degree  polynomial  Alrnon  lag  specification  with  a  far 
end  point  constraint  can  be  tested  against  the  unrestricted  lag  model  using  Chow’s  E-statistic. 
The  RRSS,  given  in  Table  6.3,  is  0.019955101  and  involves  four  restrictions.  Therefore, 


(0.019955101  -  0.016924337) /4 
0.016924337/37 


1.656 


and  this  is  distributed  as  F4  37  under  the  null  hypothesis.  This  does  not  reject  the  restrictions 
imposed  by  this  model. 


6.2  Infinite  Distributed  Lag 

So  far  we  have  been  dealing  with  a  finite  number  of  lags  imposed  on  Xt.  Some  lags  may  be 
infinite.  For  example,  the  investment  in  building  highways  and  roads  several  decades  ago  may 
still  have  an  effect  on  today’s  growth  in  GNP.  In  this  case,  we  write  equation  (6.1)  as 

Yt  =  cc  +  YT=oPiXt-i  +  ut  t  =  1, 2, . . . ,  T.  (6.7) 

There  are  an  infinite  number  of  /3f  s  to  estimate  with  only  T  observations.  This  can  only  be 
feasible  if  more  structure  is  imposed  on  the  First,  we  normalize  these  Pfis  by  their  sum, 
i.e.,  let  Wi  =  /3i/P  where  P  =  Y/aLqPi-  If  all  the  Pis  bave  the  same  sign ,  then  the  Pf  s  take 
the  sign  of  P  and  0  <  Wi  <  1  for  all  i ,  with  Y/il 0™'  =  I-  This  means  that  the  wfs  can  be 
interpreted  as  probabilities.  In  fact,  Koyck  (1954)  imposed  the  geometric  lag  on  the  wf s,  i.e., 
Wi  =  (1  —  A)A*  for  7  =  0,1,...,  oo1.  Substituting 

Pi  =  Pwi  =  P(  1  -  A)  A* 

in  (6.7)  we  get 

Yt  =  a  +  P(  1  -  A)  ZZo  Xxt-i  +  ut  (6.8) 

Equation  (6.8)  is  known  as  the  infinite  distributed  lag  form  of  the  Koyck  lag.  The  short-run 
effect  of  a  unit  change  in  Xt  on  Yt  is  given  by  fi{  1  —  A);  whereas  the  long-run  effect  of  a  unit 
change  in  Xt  on  Yt  is  A  =  P  Y/a/=  0  wi  =  P-  Implicit  in  the  Koyck  lag  structure  is  that 
the  effect  of  a  unit  change  in  Xt  on  Yj  declines  the  further  back  we  go  in  time.  For  example,  if 
A  =  1/2,  then  P0  =  P/2,  Pv  =  P/A,  P2  =  P/8,  etc.  Defining  LXt  =  Xt-i,  as  the  lag  operator, 
we  have  LlXt  =  Xt-i ,  and  (6.8)  reduces  to 

Yf  =  a.  +  P{1  —  A)  y/IE-nfA LfiXt  +  ut  =  ot  +  P{  1  —  X)Xt/ (1  —  XL)  +  ut 


(6.9) 
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where  we  have  used  the  fact  that  YliL o  c*  =  1/(1  —  c) .  Multiplying  the  last  equation  by  (1  —  XL) 
one  gets 

Yt  —  AYf_i  =  a(  1  —  A)  +  (3(1  —  A  )Xt  +  ut  —  Xut-\ 


or 


Yt  —  XYt-i  +  a(  1  —  A)  +  (3(1  —  X  )Xt  +  ut  —  Xut-i  (6.10) 

This  is  the  autoregressive  form  of  the  infinite  distributed  lag.  It  is  autoregressive  because  it 
includes  the  lagged  value  of  Yt  as  an  explanatory  variable.  Note  that  we  have  reduced  the 
problem  of  estimating  an  infinite  number  of  /3fs  into  estimating  A  and  (3  from  (6.10).  However, 
OLS  would  lead  to  biased  and  inconsistent  estimates,  because  (6.10)  contains  a  lagged  dependent 
variable  as  well  as  serially  correlated  errors.  In  fact  the  error  in  (6.10)  is  a  Moving  Average 
process  of  order  one,  i.e. ,  MA(1),  see  Chapter  14.  We  digress  at  this  stage  to  give  two  econometric 
models  which  would  lead  to  equations  resembling  (6.10). 

6.2.1  Adaptive  Expectations  Model  (AEM) 

Suppose  that  output  Yt  is  a  function  of  expected  sales  X(  and  that  the  latter  is  unobservable, 
i.e., 


Yt  =  a  +  (3X*t  +  ut 

where  expected  sales  are  updated  according  to  the  following  method 

x;-x;_1  =  6(xt-x(_1)  (6.11) 

that  is,  expected  sales  at  time  t  is  a  weighted  combination  of  expected  sales  at  time  t  —  1  and 
actual  sales  at  time  t.  In  fact, 

X*t  =  6Xt  +  (1  -  8)X*t_1  (6.12) 

Equation  (6.11)  is  also  an  error  learning  model,  where  one  learns  from  past  experience  and  adjust 
expectations  after  observing  current  sales.  Using  the  lag  operator  L,  (6.12)  can  be  rewritten  as 
X(  =  8Xt/[l  —  (1  —  8)L\.  Substituting  this  last  expression  in  the  above  relationship,  we  get 

Yt  =  a  +  (38Xt/[  1  -  (1  -  8)L\  +  ut  (6.13) 

Multiplying  both  sides  of  (6.13)  by  [1  —  (1  —  8)L\,  we  get 

Yt  -  (1  -  S)Yt- 1  =  a[(l  -  (1  -  6)}  +  (38Xt  +  ut  -  (1  -  8)ut. i  (6.14) 

(6.14)  looks  exactly  like  (6.10)  with  A  =  (1  —  8). 

6.2.2  Partial  Adjustment  Model  (PAM) 

Under  this  model  there  is  a  cost  of  being  out  of  equilibrium  and  a  cost  of  adjusting  to  that 
equilibrium,  i.e., 


Cost  =  a(Yt  -  Yt*)2  +  b(Yt  -  Ut_i)2 


(6.15) 
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where  Yj*  is  the  target  or  equilibrium  level  for  Y .  whereas  Yj  is  the  current  level  of  Y.  The 
first  term  of  (6.15)  gives  a  quadratic  loss  function  proportional  to  the  distance  of  Yt  from  the 
equilibrium  level  Y* .  The  second  quadratic  term  represents  the  cost  of  adjustment.  Minimizing 
this  quadratic  cost  function  with  respect  to  Y,  we  get  Yt  =  7Yj*+(l  — 7)Y)_i,  where  7  =  a/ ( a+b ). 
Note  that  if  the  cost  of  adjustment  was  zero,  then  b  =  0,  7  =  1,  and  the  target  is  reached 
immediately.  However,  there  are  costs  of  adjustment,  especially  in  building  the  desired  capital 
stock.  Hence, 

Yt  =  jYt*  +  (1  -  7)Y*_i  +  ut  (6.16) 

where  we  made  this  relationship  stochastic.  If  the  true  relationship  is  Yt*  =  a  +  /3Xt,  then  from 
(6.16) 


Yt  =  7a  +  il3Xt  +  (1  -  7)Y)_i  +  ut  (6.17) 

and  this  looks  like  (6.10)  with  A  =  (1  —  7),  except  for  the  error  term,  which  is  not  necessarily 
MA(1)  with  the  Moving  Average  parameter  A. 


6.3  Estimation  and  Testing  of  Dynamic  Models  with 
Serial  Correlation 

Both  the  AEM  and  the  PAM  give  equations  resembling  the  autoregressive  form  of  the  infinite 
distributed  lag.  In  all  cases,  we  end  up  with  a  lagged  dependent  variable  and  an  error  term 
that  is  either  Moving  Average  of  order  one  as  in  (6.10),  or  just  classical  or  autoregressive  as  in 
(6.17).  In  this  section  we  study  the  testing  and  estimation  of  such  autoregressive  or  dynamic 
models. 

If  there  is  a  Yj_  1  in  the  regression  equation  and  the  ut  s  are  classical  disturbances,  as  may 
be  the  case  in  equation  (6.17),  then  Yj_  1  is  said  to  be  contemporaneously  uncorrelated  with 
the  disturbance  term  ut .  In  fact,  the  disturbances  satisfy  assumptions  1-4  of  Chapter  3  and 
E(Yt-iUt)  =  0  even  though  EiXt-iUt-i)  /  0.  In  other  words,  Yt-\  is  not  correlated  with  the 
current  disturbance  ut  but  it  is  correlated  with  the  lagged  disturbance  ut~\.  In  this  case,  as 
long  as  the  disturbances  are  not  serially  correlated,  OLS  will  be  biased,  but  remains  consistent 
and  asymptotically  efficient.  This  case  is  unlikely  with  economic  data  given  that  most  macro 
time-series  variables  are  highly  trended.  More  likely,  the  ut  s  are  serially  correlated.  In  this  case, 
OLS  is  biased  and  inconsistent.  Intuitively,  Yt  is  related  to  ut,  so  Yt- 1  is  related  to  ut- 1-  If  ut 
and  ut- 1  are  correlated,  then  Yj_i  and  ut  are  correlated.  This  means  that  one  of  the  regressors, 
lagged  Y,  is  correlated  with  ut  and  we  have  the  problem  of  endogeneity.  Let  us  demonstrate 
what  happens  to  OLS  for  the  simple  autoregressive  model  with  no  constant 

Yt  =  f3Yt- 1  +  vt  \P\  <  1  t  =  1,2, ...  ,T  (6.18) 

with  ut  =  pvt-i  +  et,  \p\  <  1  and  et  IIN(0,  of).  One  can  show,  see  problem  3,  that 

Pols  =  EL  YtYt. 1/  EL  E-i  =  P  +  EL  Ef=2  E-i 

with  plim(/30iS  —  j3)  =  asyrnp.  bias(/3OLS)  =  p{l— (32) / {1  + p(3).  This  asymptotic  bias  is  positive 
if  p  >  0  and  negative  if  p  <  0.  Also,  this  asymptotic  bias  can  be  large  for  small  values  of  /3  and 
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large  values  of  p.  For  example,  if  p  =  0.9  and  j3  =  0.2,  the  asymptotic  bias  for  (3  is  0.73.  This  is 
more  than  3  times  the  value  of  (3. 

Also,  p  =  J2j=2"tvt-i/  where  Vt  =  Yt-  (3OLSYt-i  has 

plim(p  -  p)  =  -p{  1  -  /32)/(l  +  p(3)  =  -  asymp.bias 0OLS) 

This  means  that  if  p  >  0,  then  p  would  be  negatively  biased.  However,  if  p  <  o,  then  p  is 
positively  biased.  In  both  cases,  p  is  biased  towards  zero.  In  fact,  the  asymptotic  bias  of  the 
D.W.  statistic  is  twice  the  asymptotic  bias  of  Pols >  see  problem  3.  This  means  that  the  D.W. 
statistic  is  biased  towards  not  rejecting  the  null  hypothesis  of  zero  serial  correlation.  Therefore, 
if  the  D.W.  statistic  rejects  the  null  of  p  =  0,  it  is  doing  that  when  the  odds  are  against  it, 
and  therefore  confirming  our  rejection  of  the  null  and  the  presence  of  serial  correlation.  If  on 
the  other  hand  it  does  not  reject  the  null,  then  the  D.W.  statistic  is  uninformative  and  has 
to  be  replaced  by  another  conclusive  test  for  serial  correlation.  Such  an  alternative  test  in  the 
presence  of  a  lagged  dependent  variable  has  been  developed  by  Durbin  (1970),  and  the  statistic 
computed  is  called  Durbin’s  h.  Using  (6.10)  or  (6.17),  one  computes  OLS  ignoring  its  possible 
bias  and  p  from  OLS  residuals  as  shown  above.  Durbin’s  h  is  given  by 

h  =  p[n/(l  —  n  vaf(coeff.  of  Yt- 1))]1'/2.  (6.19) 

This  is  asymptotically  distributed  N(0,1)  under  null  hypothesis  of  p  =  0.  If  n[vaf(coeff.  of 
Yt-\)\  is  greater  than  one,  then  h  cannot  be  computed,  and  Durbin  suggests  running  the  OLS 
residuals  et  on  et-i  and  the  regressors  in  the  model  (including  the  lagged  dependent  variable), 
and  testing  whether  the  coefficient  of  et- 1  in  this  regression  is  significant.  In  fact,  this  test  can 
be  generalized  to  higher  order  autoregressive  errors.  Let  ut  follow  an  AR(p)  process 

ut  =  PiUt-i  +  p2ut- 2  +  •■  +  Pput-p  +  et 

then  this  test  involves  running  et  on  et-i,  et- 2, . . . ,  et-v  and  the  regressors  in  the  model  including 
Yt- 1.  The  test  statistic  for  Hq]  Pi  =  p2  =  ••  =  Pp  =  0;  is  TR2  which  is  distributed  x2.  This  is  the 
Lagrange  multiplier  test  developed  independently  by  Breusch  (1978)  and  Godfrey  (1978)  and 
discussed  in  Chapter  5.  In  fact,  this  test  has  other  useful  properties.  For  example,  this  test  is  the 
same  whether  the  null  imposes  an  AR(p)  model  or  an  MA(p)  model  on  the  disturbances,  see 
Chapter  14.  Kiviet  (1986)  argues  that  even  though  these  are  large  sample  tests,  the  Breusch- 
Godfrey  test  is  preferable  to  Durbin’s  h  in  small  samples. 

6.3.1  A  Lagged  Dependent  Variable  Model  with  AR(1)  Disturbances 

A  model  with  a  lagged  dependent  variable  and  an  autoregressive  error  term  is  estimated  using 
instrumental  variables  (IV).  This  method  will  be  studied  extensively  in  Chapter  11.  In  short,  the 
IV  method  corrects  for  the  correlation  between  Yt-\  and  the  error  term  by  replacing  Yt- 1  with 
its  predicted  value  Yt-  1  ■  The  latter  is  obtained  by  regressing  Yt-\  on  some  exogenous  variables, 
say  a  set  of  Z’s,  which  are  called  a  set  of  instruments  for  Yt- Since  these  variables  are  exogenous 
and  uncorrelated  with  ut,  Yt- 1  will  not  be  correlated  with  17 .  Suppose  the  regression  equation 
is 


Yt  —  a  +  (3Yt- 1  +  7  Xt  +  Ut 


t  =  2, . . .  ,T 


(6.20) 
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and  that  at  least  one  exogenous  variable  Zt  exists  which  will  be  our  instrument  for  Yt-  i.  Re¬ 
gressing  Yt- 1  on  Xt ,  Zt  and  a  constant,  we  get 


Yt~  i  —  Yt- 1  +  ft  —  ai  +  Q-2  Zt  +  o^Xt  +  17.  (6-21) 

Then  Yj_i  =  a\  +  a,2Zt  +  a^Xt  and  is  independent  of  ut,  because  it  is  a  linear  combination  of 
exogenous  variables.  But,  Yt- 1  is  correlated  with  ut-  This  means  that  ft  is  the  part  of  Yt- \  that 
is  correlated  with  ut .  Substituting  Yt- 1  =  Yt-\  +  ft  in  (6.20)  we  get 

Yt  =  ol  +  (3Yt-i  +  7  Xt  +  (ut  +  (3ft )  (6.22) 

Y)_i  is  uncorrelated  with  the  new  error  term  (iq  +  (3ft)  because  YYt-\ft  =  0  from  (6.21).  Also, 
Xt  is  uncorrelated  with  ut  by  assumption.  But,  from  (6.21),  Xt  also  satisfies  YXtft  =  0.  Hence, 
Xt  is  uncorrelated  with  the  new  error  term  ( ut  +  (3Yt).  This  means  that  OLS  applied  to  (6.22) 
will  lead  to  consistent  estimates  of  a,  (3  and  7.  The  only  remaining  question  is  where  do  we  find 
instruments  like  Z{!  This  Zt  should  be  (i)  uncorrelated  with  ut,  (ii)  preferably  predicting  Yj_  1 
fairly  well,  but,  not  predicting  it  perfectly,  otherwise  Yt- 1  =  Yt- If  this  happens,  we  are  back  to 
OLS  which  we  know  is  inconsistent,  (iii)  Yz^jT  should  be  finite  and  different  from  zero.  Recall 
that  zt  =  Zt  —  Z.  In  this  case,  Xt- 1  seems  like  a  natural  instrumental  variable  candidate.  It  is 
an  exogenous  variable  which  would  very  likely  predict  Yt- 1  fairly  well,  and  satisfies  Yx\_^fT 
being  finite  and  different  from  zero.  In  other  words,  (6.21)  regresses  Yt- 1  on  a  constant,  Xt- 1 
and  Xt,  and  gets  Yt- Additional  lags  on  Xt  can  be  used  as  instruments  to  improve  the  small 
sample  properties  of  this  estimator.  Substituting  Yt- 1  in  equation  (6.22)  results  in  consistent 
estimates  of  the  regression  parameters.  Wallis  (1967)  substituted  these  consistent  estimates  in 
the  original  equation  (6.20)  and  obtained  the  residuals  ut.  Then  he  computed 

P  =  EL  utut-i/(T  -  1)]/EL  n\!T\  +  (3/T) 

where  the  last  term  corrects  for  the  bias  in  p.  At  this  stage,  one  can  perform  a  Prais-Winsten 
procedure  on  (6.20)  using  p  instead  of  p,  see  Fomby  and  Guilkey  (1983). 

An  alternative  two-step  procedure  has  been  proposed  by  Hatanaka  (1974).  After  estimating 
(6.22)  and  obtaining  the  residuals  ut  from  (6.20),  Hatanaka  (1974)  suggests  running  Y*  =  Yt~ 
~pYt-\  on  Y*_x  =  Yt- 1  —'pYt- 2,  X(  =  Xt  —  pXt_i  and  ut-i-  Note  that  this  is  the  Cochrane-Orcutt 
transformation  which  ignores  the  first  observation.  Also,  p  =  3  otV-t-if  Ylt=3  'o-t  ignores  the 

small  sample  bias  correction  factor  suggested  by  Wallis  (1967).  Let  6  be  the  coefficient  of  v-t-i, 
then  the  efficient  estimator  of  p  is  given  by  p  =  p  +  8.  Hatanaka  shows  that  the  resulting 
estimators  are  asymptotically  equivalent  to  the  MLE  in  the  presence  of  Normality. 

Empirical  Example:  Consider  the  Consumption-Income  data  from  the  Economic  Report  of  the 
President  over  the  period  1959-2007  given  in  Table  5.3.  Problem  5  asks  the  reader  to  verify 
that  Durbin’s  h  obtained  from  the  lagged  dependent  variable  model  described  in  (6.20)  yields 
a  value  of  3.50.  This  is  asymptotically  distributed  as  1V(0, 1)  under  the  null  hypothesis  of  no 
serial  correlation  of  the  disturbances.  This  null  is  soundly  rejected.  The  Bruesch  and  Godfrey 
test  runs  the  regression  of  OLS  residuals  on  their  lagged  values  and  the  regressors  in  the  model. 
This  yields  a  TFL2  =  13.37.  This  is  distributed  as  xl  under  the  null  and  has  a  p- value  of  0.0003. 
Therefore,  we  reject  the  hypothesis  of  no  first-order  serial  correlation.  Next,  we  estimate  (6.20) 
using  current  and  lagged  values  of  income  (Yt,  Yt- 1  and  Yt- 2)  as  a  set  of  instruments  for  lagged 
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consumption  {Ct- 1).  The  regression  given  by  (6.22),  yields: 

Ct  =  -0.831+  0.104  Ct- 1+  1.177  Yt  +  residuals 

(0.280)  (0.303)  (0.326) 

Substituting  these  estimates  in  (6.20),  one  gets  the  residuals  ut ■  Based  on  these  ut  s,  the  Wallis 
(1967)  estimate  of  p  yields  p  =  0.907  and  Hatanaka’s  (1974)  estimate  of  p  yields  p  =  0.828. 
Running  the  Hatanaka  regression  gives 

C*  =  -0.142  +  0.233  C*_1  +  0.843  Y*  +  0.017  ut- 1  +  residuals 

(0.053)  (0.083)  (0.095)  (0.058) 

where  =  Ct  —  J>Ct- 1 .  The  efficient  estimate  of  p  is  given  by  p  =  p  +  0.017  =  0.846. 

6.3.2  A  Lagged  Dependent  Variable  Model  with  MA(1)  Disturbances 

Zellner  and  Geisel  (1970)  estimated  the  Koyck  autoregressive  representation  of  the  infinite 
distributed  lag,  given  in  (6.10).  In  fact,  we  saw  that  this  could  also  arise  from  the  AEM,  see 
(6.14).  In  particular,  it  is  a  regression  with  a  lagged  dependent  variable  and  an  MA(1)  error 
term  with  the  added  restriction  that  the  coefficient  of  Y)_i  is  the  same  as  the  MA(1)  parameter. 
For  simplicity,  we  write 

Yt  =  a  +  XYt-i  +  / 3Xf  +  ( Ut  —  \ut-i)  (6.23) 

Let  wt  =  Yt  —  ut,  then  (6.23)  becomes 

wt  =  a  +  Xwt-i  +  (3Xt  (6.24) 

By  continuous  substitution  of  lagged  values  of  wt  in  (6.24)  we  get 

uit  =  a{l  +  A  +  A2  +  ..  +  A*  ■*■)  +  A4u>o  +  (3{Xt  +  XXt—i  +  ..  +  Af  ^  X\) 

and  replacing  wt  by  ( Yt  —  ut ),  we  get 

Yt  =  a{l  +  A  +  A2  +  ..  +  A*  ■*■)  +  X^wq  +  (3{Xt  +  XXf—i  +  ..  +  A*  ^X±)  +  ut  (6.25) 

knowing  A,  this  equation  can  be  estimated  via  OLS  assuming  that  the  disturbances  ut  are  not 
serially  correlated.  Since  A  is  not  known,  Zellner  and  Geisel  (1970)  suggest  a  search  procedure 
over  A,  where  0  <  A  <  1.  The  regression  with  the  minimum  residual  sums  of  squares  gives  the 
optimal  A,  and  the  corresponding  regression  gives  the  estimates  of  a,  (3  and  wq.  The  last  coeffi¬ 
cient  w0  =  Y0  —  u0  =  E(Ya )  can  be  interpreted  as  the  expected  value  of  the  initial  observation  on 
the  dependent  variable.  Klein  (1958)  considered  the  direct  estimation  of  the  infinite  Koyck  lag, 
given  in  (6.8)  and  arrived  at  (6.25).  The  search  over  A  results  in  MLEs  of  the  coefficients.  Note, 
however,  that  the  estimate  of  wQ  is  not  consistent.  Intuitively,  as  t  tends  to  infinity,  A*  tends  to 
zero  implying  no  new  information  to  estimate  wQ.  In  fact,  some  applied  researchers  ignore  the 
variable  A*  in  the  regression  given  in  (6.25).  This  practice,  known  as  truncating  the  remainder,  is 
not  recommended  since  the  Monte  Carlo  experiments  of  Maddala  and  Rao  (1971)  and  Schmidt 
(1975)  have  shown  that  even  for  T  =  60  or  100,  it  is  not  desirable  to  omit  A*  from  (6.25). 

In  summary,  we  have  learned  how  to  estimate  a  dynamic  model  with  a  lagged  dependent 
variable  and  serially  correlated  errors.  In  case  the  error  is  autoregressive  of  order  one,  we  have 
outlined  the  steps  to  implement  the  Wallis  Two-Stage  estimator  and  Hatanaka’s  two-step  proce¬ 
dure.  In  case  the  error  is  Moving  Average  of  order  one,  we  have  outlined  the  steps  to  implement 
the  Zellner-Geisel  procedure. 
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So  far,  section  6.1  considered  finite  distributed  lags  on  the  explanatory  variables,  whereas  section 
6.2  considered  an  autoregressive  relation  including  the  first  lag  of  the  dependent  variable  and 
current  values  of  the  explanatory  variables.  In  general,  economic  relationships  may  be  generated 
by  an  Autoregressive  Distributed  Lag  (ADL)  scheme.  The  simplest  form  is  the  ADL  (1,1)  model 
which  is  given  by 


Yt  —  ol  +  XYt— 1  +  Po  Xt  PiXt- 1  +  ut 


(6.26) 


where  both  Yt  and  Xt  are  lagged  once.  By  specifying  higher  order  lags  for  Yt  and  Xt,  say  an 
ADL  (p,q)  with  p  lags  on  Yt  and  q  lags  on  Xt,  one  can  test  whether  the  specification  now  is 
general  enough  to  ensure  White  noise  disturbances.  Next,  one  can  test  whether  some  restrictions 
can  be  imposed  on  this  general  model,  like  reducing  the  order  of  the  lags  to  arrive  at  a  simpler 
ADL  model,  or  estimating  the  simpler  static  model  with  the  Cochrane-Orcutt  correction  for 
serial  correlation,  see  problem  20  in  Chapter  7.  This  general  to  specific  modelling  strategy  is 
prescribed  by  David  Hendry  and  is  utilized  by  the  econometric  software  PC-Give,  see  Gilbert 
(1986). 

Returning  to  the  ADL  (1, 1)  model  in  (6.26)  one  can  invert  the  autoregressive  form  as  follows: 
Yt  =  a(l  +  A  +  A2  +  ..)  +  (1  +  XL  +  A 2L2  -T  ..)(/3qA(  +  P\Xt—\  +  ut)  (6.27) 


provided  |A|  <  1.  This  equation  gives  the  effect  of  a  unit  change  in  Xt  on  future  values  of 
Yt.  In  fact,  dYt/dXt  =  P0  while  dYt+ i/dXt  =  Pi  +  A /30,  etc.  This  gives  the  immediate  short- 
run  responses  with  the  long-run  effect  being  the  sum  of  all  these  partial  derivatives  yielding 
(P0+P i)/(l  — A).  This  can  be  alternatively  derived  from  (6.26)  at  the  long-run  static  equilibrium 
(Y*,X*)  where  Yt  =  Yt- i  =  Y* ,  Xt  =  Xt-\  =  X*  and  the  disturbance  is  set  equal  to  zero,  i.e. , 


Y * 


a 

1  -  A 


+ 


Po  +  Pi 

1  -  A 


(6.28) 


Replacing  Yt  by  Yt- 1  +  A Yt  and  Xt  by  Xt-\  +  AXt  in  (6.26)  one  gets 


A Yt  =  a  +  P0AXt  -  (1  -  X)Yt-\  +  (/?„  +  Pi)Xt-!  +  ut 


This  can  be  rewritten  as 


A Yt  =  P0AXt  -  (1  -  A) 


a 

1  -  A 


Po  +  Pi 

1  -  A 


+  ut 


(6.29) 


Note  that  the  term  in  brackets  contains  the  long-run  equilibrium  parameters  derived  in  (6.28). 
In  fact,  the  term  in  brackets  represents  the  deviation  of  Yt- 1  from  the  long-run  equilibrium 
term  corresponding  to  Xt-\.  Equation  (6.29)  is  known  as  the  Error  Correction  Model  (ECM), 
see  Davidson,  Hendry,  Srba  and  Yeo  (1978).  Yt  is  obtained  from  Yt- 1  by  adding  the  short-run 
effect  of  the  change  in  Xt  and  a  long-run  equilibrium  adjustment  term.  Since,  the  disturbances 
are  White  noise,  this  model  is  estimated  by  OLS. 
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Note 

1.  Other  distributions  besides  the  geometric  distribution  can  be  considered.  In  fact,  a  Pascal  distri¬ 
bution  was  considered  by  Solow  (1960),  a  rational-lag  distribution  was  considered  by  Jorgenson 
(1966),  and  a  Gamma  distribution  was  considered  by  Schmidt  (1974,  1975).  See  Maddala  (1977) 
for  an  excellent  review. 


Problems 

1.  Consider  the  Consumption-Income  data  given  in  Table  5.3  and  provided  on  the  Springer  web  site 
as  CONSUMP.DAT.  Estimate  a  Consumption-Income  regression  in  logs  that  allows  for  a  six  year 
lag  on  income  as  follows: 

(a)  Use  the  linear  arithmetic  lag  given  in  equation  (6.2).  Show  that  this  result  can  also  be 
obtained  as  an  Almon  lag  first-degree  polynomial  with  a  far  end  point  constraint. 

(b)  Use  an  Almon  lag  second-degree  polynomial,  described  in  equation  (6.4),  imposing  the  near 
end  point  constraint. 

(c)  Use  an  Almon  lag  second-degree  polynomial  imposing  the  far  end  point  constraint. 

(d)  Use  an  Almon  lag  second-degree  polynomial  imposing  both  end  point  constraints. 

(e)  Using  Chow’s  E-statistic,  test  the  arithmetic  lag  restrictions  given  in  part  (a). 

(f)  Using  Chow’s  E-statistic,  test  the  Almon  lag  restrictions  implied  by  the  model  in  part  (b). 

(g)  Repeat  part  (f)  for  the  restrictions  imposed  in  parts  (c)  and  (d). 

2.  Consider  fitting  an  Almon  lag  third  degree  polynomial  f3t  =  ao  +  aii  +  c^i2  +  a3i3  for  i  =  0, 1, . . . ,  5, 
on  the  Consumption-Income  relationship  in  logarithms.  In  this  case,  there  are  five  lags  on  income, 
i.e.,  s  =  5. 

(a)  Set  up  the  estimating  equation  for  the  a*’ s  and  report  the  estimates  using  OLS. 

(b)  What  is  your  estimate  of  /?3?  What  is  the  standard  error?  Can  you  relate  the  var(/33)  to  the 
variances  and  covariances  of  the  ofy  s? 

(c)  How  would  the  OLS  regression  in  part  (a)  change  if  we  impose  the  near  end  point  constraint 

P-i  =  0? 

(d)  Test  the  near  end  point  constraint. 

(e)  Test  the  Almon  lag  specification  given  in  part  (a)  against  an  unrestricted  five  year  lag  spec¬ 
ification  on  income. 

3.  For  the  simple  dynamic  model  with  AR(1)  disturbances  given  in  (6.18), 

(a)  Verify  that  plim(/3OLS  —  p)  =  p(l  — p2)/(l  + pP).  Hint:  From  (6.18),  Yt-\  =  (3Yt_2  +  vt-i  and 

pYt_ i  =  p/3Yt_2  +  pvt_ i-  Subtracting  this  last  equation  from  (6.18)  and  re-arranging  terms, 
one  gets  Yt  =  (f3+p)Yt_i  —  p/3Yt-2  +  £t-  Multiply  both  sides  by  Y)_i  and  sum  YtYt-i  = 

(P  +  P)  Yn=2  Yt- 1  -  pPJ2^=2  Yt-iYt-2  +  J2t= 2  Yt-iet ■  Now  divide  by  J2f=  2  Yt- 1  and  take 

probability  limits.  See  Griliches  (1961). 

(b)  For  various  values  of  \p\  <  1  and  \P\  <  1,  tabulate  the  asymptotic  bias  computed  in  part  (a). 

(c)  Verify  that  plim(p  -  p)  =  ~p(  1  -  /?2)/(l  +  pP)  =  -plim {J30ls  -  P)- 
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(d)  Using  part  (c),  show  that  plim  d  =  2(1—  plim  (o)  =  2[1  —  ^  ]  where  d  =  ~ 

1  P  P 

i>t- i)2 /  J2t=i^‘t  denotes  the  Durbin- Watson  statistic. 

(e)  Knowing  the  true  disturbances,  the  Durbin- Watson  statistic  would  be  d*  =  ~ 

vt-i)2  /  J0t=i  ut  and  its  plim  d*  =  2(1  —  p).  Using  part  (d),  show  that  plim  (d  —  d*)  = 

^  ^  =  2plim(/30iS  —  /3)  obtained  in  part  (a).  See  Nerlove  and  Wallis  (1966).  For 
1  +  PP 

various  values  of  \p\  <  1  and  |/3|  <  1,  tabulate  d*  and  d  and  the  asymptotic  bias  in  part  (d). 

4.  For  the  simple  dynamic  model  given  in  (6.18),  let  the  disturbances  follow  an  MA(1)  process 
vt  =  et  +  8et-i  with  et  ~  IIN(0,  a2). 

(a)  Show  that  plim(/30iS  —  0)  =  where  <5  =  8/(1  +  82). 

(b)  Tabulate  this  asymptotic  bias  for  various  values  of  |/3|  <  1  and  0  <  8  <  1. 

(c)  Show  that  plim(—  Y/t=2  ^t)  =  cr^  [1  -|-  9(8  —  0*)]  where  9*  =  <5(1  —  01)/(  1  +  2/36)  and  Vt  = 
Yt  ~  PoLS^t- 1- 

5.  Consider  the  lagged  dependent  variable  model  given  in  (6.20).  Using  the  Consumption-Income 
data  from  the  Economic  Report  of  the  President  over  the  period  1950-1993  which  is  given  in 
Table  5.3. 


(a)  Test  for  first-order  serial  correlation  in  the  disturbances  using  Durbin’s  h  given  in  (6.19). 

(b)  Test  for  first-order  serial  correlation  in  the  disturbances  using  the  Breusch  (1978)  and  Godfrey 
(1978)  test. 

(c)  Test  for  second-order  serial  correlation  in  the  disturbances. 


6.  Using  the  U.S.  gasoline  data  in  Chapter  4,  problem  15  given  in  Table  4.2  and  obtained  from  the 
USGAS.ASC  file,  estimate  the  following  two  models: 
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( a)  Compare  the  implied  short-run  and  long-run  elasticities  for  price  (PMG)  and  income  (RGN P ) . 

(b)  Compute  the  elasticities  after  3,  5  and  7  years.  Do  these  lags  seem  plausible? 

(c)  Can  you  apply  the  Durbin- Watson  test  for  serial  correlation  to  the  dynamic  version  of  this 
model?  Perform  Durbin’s  h-test  for  the  dynamic  gasoline  model.  Also,  the  Breusch-Godfrey 
test  for  first-order  serial  correlation. 


7.  Using  the  U.S.  gasoline  data  in  Chapter  4,  problem  15,  given  in  Table  4.2  estimate  the  following 
model  with  a  six  year  lag  on  prices: 
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(a)  Report  the  unrestricted  OLS  estimates. 

(b)  Now,  estimate  a  second  degree  polynomial  lag  for  the  same  model.  Compare  the  results  with 
part  (a)  and  explain  why  you  got  such  different  results. 

(c)  Re-estimate  part  (b)  comparing  the  six  year  lag  to  a  four  year,  and  eight  year  lag.  Which 
one  would  you  pick? 

(d)  For  the  six  year  lag  model,  does  a  third  degree  polynomial  give  a  better  fit? 

(e)  For  the  model  outlined  in  part  (b),  reestimate  with  a  far  end  point  constraint.  Now,  reestimate 
with  only  a  near  end  point  constraint.  Are  such  restrictions  justified  in  this  case? 
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CHAPTER  7 

The  General  Linear  Model:  The  Basics 


7.1  Introduction 

Consider  the  following  regression  equation 

y  =  X(3  +  u  (7.1) 

where 
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with  n  denoting  the  number  of  observations  and  k  the  number  of  variables  in  the  regression, 
with  n  >  k.  In  this  case,  y  is  a  column  vector  of  dimension  (re  x  1)  and  X  is  a  matrix  of  dimension 
(re  x  k).  Each  column  of  X  denotes  a  variable  and  each  row  of  X  denotes  an  observation  on 
these  variables.  If  y  is  log(wage)  as  in  the  empirical  example  in  Chapter  4,  see  Table  4.1  then 
the  columns  of  X  contain  a  column  of  ones  for  the  constant  (usually  the  first  column),  weeks 
worked,  years  of  full  time  experience,  years  of  education,  sex,  race,  marital  status,  etc. 


7.2  Least  Squares  Estimation 

Least  squares  minimizes  the  residual  sum  of  squares  where  the  residuals  are  given  by  e  =  y  —  X/3 
and  P  denotes  a  guess  on  the  regression  parameters  f3.  The  residual  sum  of  squares 

RSS  =  EL  ei  =  e'e  =  (y~  XP)'(y  -  X(3)  =  y'y  -  y'XfJ  -  $X’y  +  pX’Xfi 

The  last  four  terms  are  scalars  as  can  be  verified  by  their  dimensions.  It  is  essential  that  the 
reader  keep  track  of  the  dimensions  of  the  matrices  used.  This  will  insure  proper  multiplication, 
addition,  subtraction  of  matrices  and  help  the  reader  obtain  the  right  answers.  In  fact  the  middle 
two  terms  are  the  same  because  the  transpose  of  a  scalar  is  a  scalar.  For  a  quick  review  of  some 
matrix  properties,  see  the  Appendix  to  this  chapter.  Differentiating  the  RSS  with  respect  to  /3 
one  gets 

dRSS/dp  =  -2  X’y  +  2X'X0  (7.2) 

where  use  is  made  of  the  following  two  rules  of  differentiating  matrices.  The  first  is  that 
da'b/db  =  a  and  the  second  is 

d{b'Ab)  /db  =  (A  +  A')b  =  2  Ab 

where  the  last  equality  holds  if  A  is  a  symmetric  matrix.  In  the  RSS  equation  a  is  y'X  and  A  is 
X'X.  The  first-order  condition  for  minimization  equates  the  expression  in  (7.2)  to  zero.  This  yields 

X'XP  =  X'y  (7.3) 
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which  is  known  as  the  OLS  normal  equations.  As  long  as  X  is  of  full  column  rank  ,  i.e.,  of  rank  k, 
then  X'X  is  nonsingular  and  the  solution  to  the  above  equations  is  fioLS  =  {x' x) ~1X'y.  Full 
column  rank  means  that  no  column  of  A  is  a  perfect  linear  combination  of  the  other  columns. 
In  other  words,  no  variable  in  the  regression  can  be  obtained  from  a  linear  combination  of  the 
other  variables.  Otherwise,  at  least  one  of  the  OLS  normal  equations  becomes  redundant.  This 
means  that  we  only  have  ( k  —  1)  linearly  independent  equations  to  solve  for  k  unknown  /3’s. 
This  yields  no  solution  for  /3OLS  and  we  say  that  X'X  is  singular.  X'X  is  the  sum  of  squares 
cross  product  matrix  (SSCP).  If  it  has  a  column  of  ones  then  it  will  contain  the  sums,  the  sum 
of  squares,  and  the  cross-product  sum  between  any  two  variables 
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Of  course  y  could  be  added  to  this  matrix  as  another  variable  which  will  generate  X'y  and 
y'y  automatically  for  us,  i.e.,  the  column  pertaining  to  the  variable  y  will  generate  )C"=1  Vi- 
£?=i  xn yi, . . . ,  Ya=i  xikVi,  and  Yh=i  Vi  -  To  see  this,  let 


Z  =  [y,  X]  then  Z'Z  = 


y'y  y'x 
X'y  X'X 


This  matrix  summarizes  the  data  and  we  can  compute  any  regression  of  one  variable  in  Z  on 
any  subset  of  the  remaining  variables  in  Z  using  only  Z'  Z.  Denoting  the  least  squares  residuals 
by  e  =  y  —  X(3OLS ,  the  OLS  normal  equations  given  in  (7.3)  can  be  written  as 

X'(y-XpOLS)=X'e  =  0  (7.4) 


Note  that  if  the  regression  includes  a  constant,  the  first  column  of  X  will  be  a  vector  of  ones 
and  the  first  equation  of  (7.4)  becomes  Xa=i  e*  =  0-  This  proves  the  well  known  result  that 
if  there  is  a  constant  in  the  regression,  the  OLS  residuals  sum  to  zero.  Equation  (7.4)  also 
indicates  that  the  regressor  matrix  X  is  orthogonal  to  the  residuals  vector  e.  This  will  become 
clear  when  we  define  e  in  terms  of  the  orthogonal  projection  matrix  on  X.  This  representation 
allows  another  interpretation  of  OLS  as  a  method  of  moments  estimator  which  was  considered 
in  Chapter  2.  This  follows  from  the  classical  assumptions  where  X  satisfies  E(X'u)  =  0.  The 
sample  counterpart  of  this  condition  yields  X'e/n  =  0.  These  are  the  OLS  normal  equations 
and  therefore,  yield  the  OLS  estimates  without  minimizing  the  residual  sums  of  squares. 

Since  data  in  economics  are  not  generated  using  experiments  like  the  physical  sciences,  the 
A’s  are  stochastic  and  we  only  observe  one  realization  of  this  data.  Consider  for  example,  annual 
observations  for  GNP,  money  supply,  unemployment  rate,  etc.  One  cannot  repeat  draws  for  this 
data  in  the  real  world  or  fix  the  X’s  to  generate  new  y’s  (unless  one  is  performing  a  Monte 
Carlo  study).  So  we  have  to  condition  on  the  set  of  A’s  observed,  see  Chapter  5. 

Classical  Assumptions:  u  ~  (0,  u2In )  which  means  that  (i)  each  disturbance  U{  has  zero  mean, 
(ii)  constant  variance,  and  (iii)  Ui  and  Uj  for  i  ^  j  are  not  correlated.  The  u’s  are  known  as 
spherical  disturbances.  Also,  (iv)  the  conditional  expectation  of  u  given  X  is  zero,  E(u/X)  =  0. 
Note  that  the  conditioning  here  is  with  respect  to  every  regressor  in  X  and  for  all  observations 
i  =  1,2,  ...n.  In  other  words,  it  is  conditional  on  all  the  elements  of  the  matrix  X.  Using 
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(7.1),  this  implies  that  E{y/X)  =  Xf3  is  linear  in  P,  var (m/X)  =  a 2  and  co v(ui,Uj/X)  =  0. 
Additionally,  we  assume  that  plim  X'X/n  is  finite  and  positive  definite  and  plim  X'u/n  =  0  as 
n  — >  oo. 

Given  these  classical  assumptions,  and  conditioning  on  the  A’s  observed,  it  is  easy  to  show 
that  Pols  is  unbiased  for  p.  In  fact  using  (7.1)  one  can  write 

Pols  =  P  +  (X'X)-1X,u  (7.5) 

Taking  expectations,  conditioning  on  the  X’s,  and  using  assumptions  (i)  and  (iv),  one  attains 
the  unbiasedness  result.  Furthermore,  one  can  derive  the  variance-covariance  matrix  of  Pols 
from  (7.5)  since 

yar (Pols)  =  E CPols  ~  P)CPols  ~  PY  =  E{X' X)~lX'uu' X{X' X)~l  =  o2{X'X)~1  (7.6) 

this  uses  assumption  (iv)  along  with  the  fact  that  E(uu ')  =  cr2In.  This  variance-covariance 
matrix  is  (k  x  k)  and  gives  the  variances  of  the  PP s  across  the  diagonal  and  the  pairwise 
covariances  of  say  Pi  and  Pj  off  the  diagonal.  The  next  theorem  shows  that  among  all  linear 
unbiased  estimators  of  c'P,  it  is  d Pols  which  has  the  smallest  variance.  This  is  known  as  the 
Gauss-Markov  Theorem. 

Theorem  1:  Consider  the  linear  estimator  a'y  for  c'P,  where  both  a  and  c  are  arbitrary  vectors 
of  constants.  If  a'y  is  unbiased  for  d  P  then  var(a'y)  >  var {d  Pqls)- 

Proof:  For  a'y  to  be  unbiased  for  c' P  it  must  follow  from  (7.1)  that  E(a'y)  =  a'XP  +  E(a'u)  = 
a'XP  =  c'P  which  means  that  a' X  =  d  .  Also,  var(a'y)  =  E(a!y  —  d P)(a'y  —  d P)'  =  E(a'uu'a)  = 
a2 a' a.  Comparing  this  variance  with  that  of  c'P0LSi  one  gets  var(a/y)—  var {d  Pols)  =  u2a'a  — 
o2d {X’ X)~1c.  But,  d  =  a'X ,  therefore  this  difference  becomes  a2 [a' a  —  a'Pxa]  =  o2a'Pxa 
where  Px  is  a  projection  matrix  on  the  X-plane  defined  as  X{X'  X)~lX'  and  Px  is  defined  as 
In  -  Px-  In  fact,  Pxy  =  XPOLS  =  y  and  Pxy  =  y  -  PxV  =  y  -  y  =  e.  So  that  y  projects 
the  vector  y  on  the  X-plane  and  e  is  the  projection  of  y  on  the  plane  orthogonal  to  X  or 
perpendicular  to  X,  see  Figure  7.1.  Both  Px  and  Px  are  idempotent  which  means  that  the 
above  difference  a2a' Pxa  is  greater  or  equal  to  zero  since  Px  is  positive  semi-definite.  To  see 
this,  define  z  =  Pxa,,  then  the  above  difference  is  equal  to  cr2z'z  >  0. 

The  implications  of  the  theorem  are  important.  It  means  for  example,  that  for  the  choice 
of  d  =  (1,  0, . . . ,  0)  one  can  pick  Pi  =  d P  for  which  the  best  linear  unbiased  estimator  would 
be  Pi  ols  =  d Pols-  Similarly  any  /3  ■  can  be  chosen  by  using  d  =  (0, . . . ,  1, . . . ,  0)  which  has 
1  in  the  j-th  position  and  zero  elsewhere.  Again,  the  BLUE  of  Pj  =  d P  is  Pj^oLS  =  d Pols- 
Furthermore,  any  linear  combination  of  these  P's  such  as  their  sum  'Yl!j=\  Pj  which  corresponds 
to  d  =  {1,1, ,  1)  has  the  sum  @j,OLS  as  ^s  BLUE. 

The  disturbance  variance  a 2  is  unknown  and  has  to  be  estimated.  Note  that  E{u'u )  = 
E{ti{uu'))  =  ti{E{uu'))  =  tr(<r2/n)  =  no 2,  so  that  u'u/n  seems  like  a  natural  unbiased  es¬ 
timator  for  a2.  However,  u  is  not  observed  and  is  estimated  by  the  OLS  residuals  e.  It  is 
therefore,  natural  to  investigate  E(e'e).  In  what  follows,  we  show  that  s 2  =  e'e/{n  —  k )  is  an 
unbiased  estimator  for  o2.  To  prove  this,  we  need  the  fact  that 


e  =  y-  XPols  =  y-  X(X'X)~1X'y  =  Pxy  =  PXu 


(7.7) 
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Figure  7.1  The  Orthogonal  Decomposition  of  y 


where  the  last  equality  follows  from  the  fact  that  PxX  =  0.  Hence, 

E(e'e)  =  E(v!  Pxu)  =  E{\,y{u!  Pxu})  =  E(tr{uu'  Px}) 

=  tr  {(i2Px)  =  cr2tr(Px)  =  cr2(n  —  k ) 

where  the  second  equality  follows  from  the  fact  that  the  trace  of  a  scalar  is  a  scalar.  The 
third  equality  from  the  fact  that  tr  (ABC)  =  tr  (CAB).  The  fourth  equality  from  the  fact  that 
E(trace )  =  trace{E(.)},  and  E(uv!)  =  c2In.  The  last  equality  from  the  fact  that 

tr  (Px)  =  ti(In)-ti(Px)  =  n-ti{X{X'X)-1X') 

=  n  —  tT(X'X(X'X)~1)  =  n  —  tr  (I&)  =  n  —  k. 

Hence,  an  unbiased  estimator  of  var(/3OLS)  =  a2(X,X)~1  is  given  by  s2(Xr X)^1 . 

So  far  we  have  shown  that  Pols  BLUE.  It  can  also  be  shown  that  it  is  consistent  for  /?.  In 
fact,  taking  probability  limits  of  (7.5)  as  n  — >  oo,  one  gets 

plim(/30i5)  =  plim(/3)  +  plim  (X1  X/n)~1(X,u/n)  =  (3 

The  first  equality  uses  the  fact  that  the  plim  of  a  sum  is  the  sum  of  the  plims.  The  second 
equality  follows  from  assumption  1  and  the  fact  that  plim  of  a  product  is  the  product  of  plims. 


7.3  Partitioned  Regression  and  the  Frisch-Waugh-Lovell  Theorem 

In  Chapter  4,  we  studied  a  useful  property  of  least  squares  which  allows  us  to  interpret  multiple 
regression  coefficients  as  simple  regression  coefficients.  This  was  called  the  residualing  inter¬ 
pretation  of  multiple  regression  coefficients.  In  general,  this  property  applies  whenever  the  k 
regressors  given  by  X  can  be  separated  into  two  sets  of  variables  X±  and  X2  of  dimension  {n  x  k\) 
and  (n  x  A^)  respectively,  with  X  =  \X\,X^  and  k  =  k\  +  k2 ■  The  regression  in  equation  (7.1) 
becomes  a  partitioned  regression  given  by 


y  —  X fj  +  u  —  X1P1  +  X2P2  ^ 


(7.8) 
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One  may  be  interested  in  the  least  squares  estimates  of  j32  corresponding  to  X2,  but  one  has 
to  control  for  the  presence  of  X\  which  may  include  seasonal  dummy  variables  or  a  time  trend, 
see  Frisch  and  Waugh  (1933)  and  Lovell  (1963) 1 . 

The  OLS  normal  equations  from  (7.8)  are  as  follows: 


■  X[Xi 

x[x2 ' 

Pi, OLS 

'  x[y  ' 

X'2X\ 

x2x2 . 

_  @2, OLS 

.  X'2  y 

These  can  be  solved  by  partitioned  inversion  of  the  matrix  on  the  left,  see  the  Appendix  to  this 
chapter,  or  by  solving  two  equations  in  two  unknowns.  Problem  2  asks  the  reader  to  verify  that 


hoLS  =  (X^X^X^y  (7.10) 

where  Px1  =  In  —  Px-,  and  Px1  =  Ad(A[Ai)_1A{.  Px,  is  the  orthogonal  projection  matrix  of 
X\  and  Px,X2  generates  the  least  squares  residuals  of  each  column  of  X2  regressed  on  all  the 
variables  in  X\.  In  fact,  if  we  denote  by  X2  =  Px  1^2  and  y  =  Px, y,  then  (7.10)  can  be  written 
as 


hoLS  =  (X^r'Xfi  (7.11) 

using  the  fact  that  Px1  is  idempotent.  This  implies  that  /32  0ls  can  be  obtained  from  the 
regression  of  y  on  X2.  In  words,  the  residuals  from  regressing  y  on  X\  are  in  turn  regressed 
upon  the  residuals  from  each  column  of  X2  regressed  on  all  the  variables  in  X\.  This  was 
illustrated  in  Chapter  4  with  some  examples.  Following  Davidson  and  MacKinnon  (1993)  we 
denote  this  result  more  formally  as  the  Frisch- Waugh-Lovell  (FWL)  Theorem.  In  fact,  if  we 
premultiply  (7.8)  by  Px1  and  use  the  fact  that  Px,X\  =  0,  one  gets 

PxiV  =  Px1X2f32  +  Px,u  (7-12) 

The  FWL  Theorem  states  that:  (1)  The  least  squares  estimates  of  /32  from  equations  (7.8) 
and  (7.12)  are  numerically  identical  and  (2)  The  least  squares  residuals  from  equations  (7.8) 
and  (7.12)  are  identical. 

Using  the  fact  that  Pxx  is  idempotent,  it  immediately  follows  that,  OLS  on  (7.12)  yields  /32  OLS 
as  given  by  equation  (7.10).  Alternatively,  one  can  start  from  equation  (7.8)  and  use  the  result 
that 


u  —  Pxy  +  Pxy  —  X(3OLS  +  Pxy  —  X\j310LS  +  X2P2  ,ols  +  Pxy  (7 .13) 

where  Px  =  X{X' X)~1X'  and  Px  =  In  —  Px-  Premultiplying  equation  (7.13)  by  X2Px  1  and 
using  the  fact  that  Pxx X \  =  0,  one  gets 

x'2PXly  =  x'2pXix2\ols  +  x'2PXlPxy  (7.14) 

But,  Pxi Px  =  Px i-  Hence,  Px1Px  =  Px-  Using  this  fact  along  with  PxX  =  Px[Xi,  X2]  =  0, 
the  last  term  of  equation  (7.14)  drops  out  yielding  the  result  that  f32  OLS  from  (7.14)  is  identical 
to  the  expression  in  (7.10).  Note  that  no  partitioned  inversion  was  used  in  this  proof.  This  proves 
part  (1)  of  the  FWL  Theorem. 


156 


Chapter  7:  The  General  Linear  Model:  The  Basics 


Also,  premultiplying  equation  (7.13)  by  Px1  and  using  the  fact  that  PxA  Px  =  Px,  one  gets 

PxxU  =  Px1X2(32,ols  +  Pxy  (7.15) 

Now  /?2,ols  was  shown  to  be  numerically  identical  to  the  least  squares  estimate  obtained  from 
equation  (7.12).  Hence,  the  first  term  on  the  right  hand  side  of  equation  (7.15)  must  be  the 
fitted  values  from  equation  (7.12).  Since  the  dependent  variables  are  the  same  in  equations 
(7.15)  and  (7.12),  Pxy  in  equation  (7.15)  must  be  the  least  squares  residuals  from  regression 
(7.12).  But,  Pxy  is  the  least  squares  residuals  from  regression  (7.8).  Hence,  the  least  squares 
residuals  from  regressions  (7.8)  and  (7.12)  are  numerically  identical.  This  proves  part  (2)  of  the 
FWL  Theorem. 

Several  applications  of  the  FWL  Theorem  will  be  given  in  this  book.  Problem  2  shows  that  if 
X\  is  the  vector  of  ones  indicating  the  presence  of  a  constant  in  the  regression,  then  regression 
(7.15)  is  equivalent  to  running  (j/j  —  y)  on  the  set  of  variables  in  X2  expressed  as  deviations  from 
their  respective  sample  means.  Problem  3  shows  that  the  FWL  Theorem  can  be  used  to  prove 
that  including  a  dummy  variable  for  one  of  the  observations  in  the  regression  is  equivalent  to 
omitting  that  observation  from  the  regression. 


7.4  Maximum  Likelihood  Estimation 

In  Chapter  2,  we  introduced  the  method  of  maximum  likelihood  estimation  which  is  based  on 
specifying  the  distribution  we  are  sampling  from  and  writing  the  joint  density  of  our  sample. 
This  joint  density  is  then  referred  to  as  the  likelihood  function  because  it  gives  for  a  given 
set  of  parameters  specifying  the  distribution,  the  probability  of  obtaining  the  observed  sample. 
See  Chapter  2  for  several  examples.  For  the  regression  equation,  specifying  the  distribution  of 
the  disturbances  in  turn  specifies  the  likelihood  function.  These  disturbances  could  be  Poisson, 
Exponential,  Normal,  etc.  Once  this  distribution  is  chosen,  the  likelihood  function  is  maximized 
and  the  MLE  of  the  regression  parameters  are  obtained.  Maximum  likelihood  estimators  are 
desirable  because  they  are  (1)  consistent  under  fairly  general  conditions,2  (2)  asymptotically 
normal,  (3)  asymptotically  efficient  and  (4)  invariant  to  reparameterizations  of  the  model3.  Some 
of  the  undesirable  properties  of  MLE  are  that  (1)  it  requires  explicit  distributional  assumptions 
on  the  disturbances,  and  (2)  their  finite  sample  properties  can  be  quite  different  from  their 
asymptotic  properties.  For  example,  MLE  can  be  biased  even  though  they  are  consistent,  and 
their  covariance  estimates  can  be  misleading  for  small  samples.  In  this  section,  we  derive  the 
MLE  under  normality  of  the  disturbances. 

The  Normality  Assumption:  u  ~  N(0,o2In).  This  additional  assumption  allows  us  to 
derive  distributions  of  estimators  and  other  random  variables.  This  is  important  for  constructing 
confidence  intervals  and  tests  of  hypotheses.  In  fact  using  (7.5)  one  can  easily  see  that  Pols  is 
a  linear  combination  of  the  it’s.  But,  a  linear  combination  of  normal  random  variables  is  itself 
a  normal  random  variable.  Hence,  Pols  is  AT(/3,  o2{X'X)~l).  Similarly  y  is  N(XP,o2In )  and 
e  is  N(0,o2Px)-  Moreover,  we  can  write  the  joint  probability  density  function  of  the  it’s  as 
f(ui,U2,  ■  ■  ■ , un;  a2)  =  (l/27T(j2)ri'/2exp(— u'u/2o2).  To  get  the  likelihood  function  we  make  the 
transformation  u  =  y  —  X P  and  note  that  the  Jacobian  of  the  transformation  is  one.  Hence 

f{yi,V2,  ■  ■  ■  ,yn',P,  v2)  =  (l/27ro-2)"/2exp{-(y  -  XP)\y  -  XP)/2o2} 


(7.16) 
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Taking  the  log  of  this  likelihood,  we  get 

log L(P,  a2)  =  -(n/2)log(2vrcr2)  -  (y  -  X/3)'(y  -  XP)/2a2  (7.17) 

Maximizing  this  likelihood  with  respect  to  /3  and  ct2  one  gets  the  maximum  likelihood  estimators 
(MLE).  Let  0  =  a2  and  Q  =  (y  —  Xj3)\y  —  X/3),  then 

dlogL{(3 ,  9)  _  2 X'y  -  2 X'Xp 

<9/3  ~  26 

8logL(P,  9)  Q  n 

89  =  202  ~~  29 

Setting  these  first-order  conditions  equal  to  zero,  one  gets 

Pmle  =  Pols  and  0  =  d2MLE  =  Q/n  =  RSS/n  =  e'e/n. 


Intuitively,  only  the  second  term  in  the  log  likelihood  contains  P  and  that  term  (without  the 
negative  sign)  has  already  been  minimized  with  respect  to  /3  in  (7.2)  giving  us  the  OLS  estimator. 
Note  that  &mle  differs  from  s2  only  in  the  degrees  of  freedom.  It  is  clear  that  Pmle  is  unbiased 
for  /3  while  dj^LE  is  not  unbiased  for  a2.  Substituting  these  MLE’s  into  (7.17)  one  gets  the 
maximum  value  of  logL  which  is 

log  ^{Pmlet^mle)  =  ~ (n/2)log(27rcr|fZ)£;)  —  e'e/2a2MLE 

=  —  (n/2)log(27r)  —  (n/2)log(e/e/n)  —  ra/2 

=  constant  —  (n/2)log(e/e). 

In  order  to  get  the  Cramer-Rao  lower  bound  for  the  unbiased  estimators  of  P  and  a2  one  first 
computes  the  information  matrix 


d2logL/dpdp'  82\ogL/dpda2 
82logL  /  da2  8  P'  82logL / 8a2  da2 


(7.18) 


Recall,  that  9  =  a2  and  Q  =  (y  —  Xp)'(y  —  XP).  It  is  easy  to  show  (see  problem  4)  that 

82\ogL{P ,  9)  _  1  8Q  82logL(P,  9)  _  -X\y  -  XP) 

8P89  2 92  dp  an  898P  9 2 


Therefore, 

(82\ogL(P,9)\  _  —E(X'u)  _ 

V  998P  )  9 2 

Also 

82logL(P,  9)  —X'X  92logL(/3, 9)  —4 Q  2 n  —Q  n 

8 pdp'  “  9  an  W  ““  4 93  +  A92  ~  IF  +  W2 

so  that 

v  fd2logL(P,  9)  ^  —n9  t  n  —2 n  +  n  —n 

V  8(P  /  _  IF" +  202  -  2d2  -  2^2 
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using  the  fact  that  E(Q)  =  na2  =  n6.  Hence, 


I(P,a2) 


X'X/a2  0 
0  n/2u4 


(7.19) 


The  information  matrix  is  block-diagonal  between  /3  and  a2 .  This  is  an  important  property  for 
regression  models  with  normal  disturbances.  It  implies  that  the  Cramer-Rao  lower  bound  is 


r\f3,a2) 


a2( X'X)~ 1  0 

0  2a4/n 


(7.20) 


Note  that  Pmle  =  Pols  attains  the  Cramer-Rao  lower  bound.  Under  normality,  Pols  is 
MVU  (minimum  variance  unbiased) .  This  is  best  among  all  unbiased  estimators  not  only  linear 
unbiased  estimators.  By  assuming  more  (in  this  case  normality)  we  get  more  (MVU  rather  than 
BLUE)4. 

Problem  5  derives  the  variance  of  s 2  under  normality  of  the  disturbances.  This  is  found  to 
be  2cr 4/(n  —  k).  This  means  that  s 2  does  not  attain  the  Cramer-Rao  lower  bound.  However, 
following  the  theory  of  complete  sufficient  statistics  one  can  show  that  both  Pols  and  s 2  are 
MVU  for  their  respective  parameters  and  therefore  both  are  small  sample  efficient.  Note  also 
that  a2MLE  is  biased,  therefore  it  is  not  meaningful  to  compare  its  variance  to  the  Cramer-Rao 
lower  bound.  There  is  a  trade-off  between  bias  and  variance  in  estimating  a2.  Problem  6  looks 
at  all  estimators  of  cr2  of  the  type  e’e/r  and  derives  r  such  that  the  mean  squared  error  (MSE) 
is  minimized.  The  choice  of  r  turns  out  to  be  (n  —  k  +  2). 

We  found  the  distribution  of  Pols>  now  we  derive  the  distribution  of  s 2 .  In  order  to  do  that 
we  need  a  result  from  matrix  algebra,  which  is  stated  without  proof,  see  Graybill  (1961). 

Lemma  1:  For  every  symmetric  idempotent  matrix  A  of  rank  r,  there  exists  an  orthogonal 
matrix  P  such  that  P’AP  =  Jr  where  Jr  is  a  diagonal  matrix  with  the  first  r  elements  equal  to 
one  and  the  rest  equal  to  zero. 

We  use  this  lemma  to  show  that  the  RSS/a 2  is  a  chi-squared  with  ( n  —  k )  degrees  of  freedom. 
To  see  this  note  that  e'e/a2  =  u'Pxu/a 2  and  that  Px  is  symmetric  and  idempotent  of  rank 
(n  —  k).  Using  the  lemma  there  exists  a  matrix  P  such  that  P'PxP  =  Jn-k  is  a  diagonal  matrix 
with  the  first  (n  —  k )  elements  on  the  diagonal  equal  to  1  and  the  last  k  elements  equal  to  zero. 
Now  make  the  change  of  variable  v  =  P’u.  This  makes  v  ~  IV(0,  a2In)  since  the  u’s  are  linear 
combinations  of  the  it’s  and  P'P  =  In.  Replacing  u  by  v  in  RSS/a 2  we  get 

vPPxPv/o2  =  v'Jn_kv/a 2  =  YPiZi  vi  / °2 

where  the  last  sum  is  only  over  i  =  1,  2, . . . ,  n  —  k.  But,  the  v’s  are  independent  identically 
distributed  N( 0,  a 2),  hence  vf/a2  is  the  square  of  a  standardized  IV(0, 1)  random  variable  which 
is  distributed  as  a  x\.  Moreover,  the  sum  of  independent  x2  random  variables  is  a  x2  random 
variable  with  degrees  of  freedom  equal  to  the  sum  of  the  respective  degrees  of  freedom.  Hence, 
RSS/a2  is  distributed  as  Xn-k- 

The  beauty  of  the  above  result  is  that  it  applies  to  all  quadratic  forms  u'Au  where  A  is 
symmetric  and  idempotent.  We  will  use  this  result  again  in  the  test  of  hypotheses  section. 
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7.5  Prediction 


Let  us  now  predict  Ta  periods  ahead.  Those  new  observations  are  assumed  to  satisfy  (7.1).  In 
other  words 


Vo  —  X0(3  T  u0 


(7.21) 


What  is  the  Best  Linear  Unbiased  Predictor  (BLUP)  of  E{y0)l  From  (7.21),  E(y0)  =  X0(3  which 
is  a  linear  combination  of  the  /3’s.  Using  the  Gauss-Markov  result  y0  =  X0(3qLs  is  BLUE  for 
X0(3  and  the  variance  of  this  predictor  of  E(y0)  is  X0vai{(3OLS)X'0  =  a2  X0{X'  X)~l  X'Q.  But, 
what  if  we  are  interested  in  the  predictor  for  yQl  The  best  predictor  of  uQ  is  zero,  so  the  predictor 
for  yQ  is  still  yQ  but  its  MSE  is 

E{ya  ~  VaKVo  ~  Vo)'  =  E{X0(J3ols  -  (3)  -  u0}{X0{J3OLS  -  (3)  -  u0}' 

=  Xowax0OLS)X'o  +  a2ITo  -  ‘2cov{X0(f3OLS  -  f3),u0}  (7.22) 

=  a2X0(X'X)-1X'0  +  a2ITo 


the  last  equality  follows  from  the  fact  that  ( Pols  ~  P)  =  {X'  X)~1X'u  and  uQ  have  zero  co- 
variance.  The  latter  holds  because  uQ  and  u  have  zero  covariance.  Intuitively  this  says  that  the 
future  Ta  disturbances  are  not  correlated  with  the  current  sample  disturbances. 

Therefore,  the  predictor  of  the  average  consumption  of  a  $20, 000  income  household  is  the 
same  as  the  predictor  of  consumption  of  a  specific  household  whose  income  is  $20,  000.  The 
difference  is  not  in  the  predictor  itself  but  in  the  MSE  attached  to  it.  The  latter  MSE  being 
larger. 

Salkever  (1976)  suggested  a  simple  way  to  compute  these  forecasts  and  their  standard  errors. 
The  basic  idea  is  to  augment  the  usual  regression  in  (7.1)  with  a  matrix  of  observation-specific 
dummies,  i.e.,  a  dummy  variable  for  each  period  where  we  want  to  forecast: 


y 

Vo 


X  0 

ITo 


(7.23) 


y*  =  X*6  +  u* 


(7.24) 


where  6'  =  (ft X*  has  in  its  second  part  a  matrix  of  dummy  variables  one  for  each  of 
the  Ta  periods  for  which  we  are  forecasting.  Since  these  Ta  observations  do  not  serve  in  the 
estimation,  problem  7  asks  the  reader  to  verify  that  OLS  on  (7.23)  yields  6  =  (/?  ,7  )  where 
/3  =  (X'X)~1X'y,  7  =  ya  —  y0,  and  y0  =  X0(3.  In  other  words,  OLS  on  (7.23)  yields  the 
OLS  estimate  of  (3  without  the  T0  observations,  and  the  coefficients  of  the  T0  dummies  are 
the  forecast  errors.  This  also  means  that  the  first  n  residuals  are  the  usual  OLS  residuals 
e  =  y  —  X/3  based  on  the  first  n  observations,  whereas  the  next  T0  residuals  are  all  zero. 


Therefore,  s*2  =  s2  =  e/e/(n  —  k),  and  the  variance  covariance  matrix  of  6  is  given  by 


s2(X*'X*) 


~l  =  s2 


(X'X) 


-1 


[ITo  +  X0(J OX)-1^] 


(7.25) 


and  the  off-diagonal  elements  are  of  no  interest.  This  means  that  the  regression  package  gives 
the  estimated  variance  of  (3  and  the  estimated  variance  of  the  forecast  error  in  one  stroke.  Note 
that  if  the  forecasts  rather  than  the  forecast  errors  are  needed,  one  can  replace  yQ  by  zero,  and 
Itb  by  —It0  in  (7.23).  The  resulting  estimate  of  7  will  be  yQ  =  X0[3,  as  required.  The  variance 
of  this  forecast  will  be  the  same  as  that  given  in  (7.25),  see  problem  7. 


160 


Chapter  7:  The  General  Linear  Model:  The  Basics 


7.6  Confidence  Intervals  and  Test  of  Hypotheses 

We  start  by  constructing  a  confidence  interval  for  any  linear  combination  of  f3,  say  d  (3.  We 
know  that  d /30ls  ~  iV (d (3 ,  a2 d {X' X)^1  c)  and  it  is  a  scalar.  Hence, 

z0bs  =  (c'3ols  -  df3)/a(d(X,X)-1c)1/2  (7.26) 

is  a  standardized  1V(0, 1)  random  variable.  Replacing  a  by  s  is  equivalent  to  dividing  zQbs  by 
the  square  root  of  a  x2  random  variable  divided  by  its  degrees  of  freedom.  The  latter  random 
variable  is  (n  —  k)s2 /a2  =  RSS/a2  which  was  shown  to  be  a  Xn-k ■  Problem  8  shows  that  z0bs 
and  RSS/a2  are  independent.  This  means  that 

Us  =  ( c%LS  ~  dp)/s(d(X'X)~1c)1/2  (7.27) 

is  a  iV(0, 1)  random  variable  divided  by  the  square  root  of  an  independent  Xn-k/(n  ~  ^)-  This 
is  a  i-statistic  with  (n  —  k )  degrees  of  freedom.  Hence,  a  100(1  —  a)%  confidence  interval  for  d (3 
is 


ddOLS  ±  tapsViX’Xr'c)1'2  (7.28) 

Example:  Let  us  say  we  are  predicting  one  year  ahead  so  that  Ta  =  1  and  xa  is  a  (1  x  k)  vector 
of  next  year’s  observations  on  the  exogenous  variables.  The  100(1  —  a)  confidence  interval  for 
next  year’s  forecast  of  yQ  will  be  yQ  ±  ta/2s(l  +  x'0{X'  X)~1x0)1^2 .  Similarly  (7.28)  allows  us  to 
construct  confidence  intervals  or  test  any  single  hypothesis  on  any  single  f3j  (again  by  picking 
c  to  have  1  in  its  j-th  position  and  zero  elsewhere).  In  this  case  we  get  the  usual  t-statistic 
reported  in  any  regression  package.  More  importantly,  this  allows  us  to  test  any  hypothesis 
concerning  any  linear  combination  of  the  /3’s,  e.g.,  testing  that  the  sum  of  coefficients  of  input 
variables  in  a  Cobb-Douglas  production  function  is  equal  to  one.  This  is  known  as  a  test  for 
constant  returns  to  scale,  see  Chapter  4. 


7.7  Joint  Confidence  Intervals  and  Test  of  Hypotheses 

We  have  learned  how  to  test  any  single  hypothesis  involving  any  linear  combination  of  the 
/3’s.  But  what  if  we  are  interested  in  testing  two  or  three  or  more  hypotheses  involving  linear 
combinations  of  the  /3’s.  For  example,  testing  that  (32  =  /?4  =  0,  i.e. ,  that  variables  X2  and  X4 
are  not  significant  in  the  model.  This  can  be  written  as  d2(3  =  d^ft  =  0  where  c(  is  a  row  vector 
of  zeros  with  a  one  in  the  j-th  position.  In  order  to  test  these  two  hypotheses  simultaneously, 
we  rearrange  these  restrictions  on  the  /3’s  in  matrix  form  R/3  =  0  where  R'  =  [02,04].  In  a 
similar  fashion,  we  can  rearrange  g  restrictions  on  the  j3,s  into  this  matrix  R  which  will  now  be 
of  dimension  ( g  x  k).  Also  these  restrictions  need  not  be  of  the  form  R/3  =  0  and  can  be  of  the 
more  general  form  R/3  =  r  where  r  is  a  (g  x  1)  vector  of  constants.  For  example,  (3i  +  f32  =  1 
and  3/?3  +  2/34  =  5  are  two  such  restrictions.  Since  R/3  is  a  collection  of  linear  combinations 
of  the  /3’s,  the  BLUE  of  these  is  R/30ls  and  the  latter  is  distributed  N(Rf3,  a2R(X'X)~1R'). 
Standardization  of  the  form  encountered  with  the  scalar  d (3  gives  us  the  following: 


(rpols  -  Rpymx'xr'R'r'iRPoLs  -  m/°2 


(7.29) 
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rather  than  divide  by  the  variance  we  multiply  by  its  inverse,  and  since  we  divided  by  the 
variance  rather  than  the  standard  deviation  we  square  the  numerator  which  means  in  vector 
form  premultiplying  by  its  transpose.  Problem  9  replaces  the  matrix  R  by  the  vector  c!  and 
shows  that  (7.29)  reduces  to  the  square  of  the  ^-statistic  observed  in  (7.26).  This  also  proves 
that  the  resulting  statistic  is  distributed  as  Xi-  But,  what  is  the  distribution  of  (7.29)?  The 
trick  is  to  write  it  in  terms  of  the  original  disturbances,  i.e., 

u'X(X,X)-1R,[R(X,X)-1R,}-1R(X'X)-1X'u/a2  (7.30) 

where  (RPols  ~  RP)  is  replaced  by  R(X'X)~1X'u.  Note  that  (7.30)  is  quadratic  in  the  dis¬ 
turbances  u  of  the  form  u'Au/a2.  Problem  10  shows  that  A  is  symmetric  and  idempotent  and 
of  rank  g.  Applying  the  same  proof  as  given  below  lemma  1  we  get  the  result  that  (7.30)  is 
distributed  as  x2g-  Again  a2  is  unobserved,  so  we  divide  by  (n  —  k)s2 jo2  which  is  Xn-k •  This 
becomes  a  ratio  of  two  x2’s  random  variables.  If  we  divide  the  numerator  and  denominator  x2’s 
by  their  respective  degrees  of  freedom  and  prove  that  they  are  independent  (see  problem  11) 
the  resulting  statistic 

(Mols  -  r)'[R(X'X)-2R:]-\RPOLS  -  r)/gs 2  (7.31) 

is  distributed  under  the  null  R(3  =  r  as  an  F(g,  n  —  k). 

7.8  Restricted  MLE  and  Restricted  Least  Squares 

Maximizing  the  likelihood  function  given  in  (7.16)  subject  to  R/3  =  r  is  equivalent  to  minimizing 
the  residual  sum  of  squares  subject  to  R/3  =  r.  Forming  the  Lagrangian  function 


*(13,  V)  =  (y~  Xf3)'{y  -  Xf3 )  +  2 g'(RP  -  r )  (7.32) 

and  differentiating  with  respect  to  (3  and  g  one  gets 

d^(P,  g)  I  dp  =  —2X'y  +  2  X'Xp  +  2  R!  g  =  0  (7.33) 

d^>(P,g)/dg  =  2(Rp-r)  =  0  (7.34) 

Solving  for  g,  we  premultiply  (7.33)  by  R(X'X)~ 1  and  use  (7.34) 

g  =  [R(X'X) _1i?/] _1  (RPols  -  r)  (7.35) 

Substituting  (7.35)  in  (7.33)  we  get 

Prls  =  Pols  ~  (X'X/^RpRiX'X)-1^]-1  (RPols  ~  r)  (7.36) 


The  restricted  least  squares  estimator  of  P  differs  from  that  of  the  unrestricted  OLS  estimator 
by  the  second  term  in  (7.36)  with  the  term  in  parentheses  showing  the  extent  to  which  the 
unrestricted  OLS  estimator  satisfies  the  constraint.  Problem  12  shows  that  Prls  is  biased 
unless  the  restriction  RP  =  r  is  satisfied.  However,  its  variance  is  always  less  than  that  of  Pols- 
This  brings  in  the  trade-off  between  bias  and  variance  and  the  MSE  criteria  which  was  discussed 
in  Chapter  2. 

The  Lagrange  Multiplier  estimator  g  is  distributed  JV(0,  a2[R(X' X)~2 R']^1)  under  the  null 
hypothesis.  Therefore,  to  test  g  =  0,  we  use 

g[R(X' X)-1R!]g/ a2  =  ( RPpOLS  -  r)'[R(X' X)~l R']-1  (RPols  -  P/v2  (7.37) 

Since  g  measures  the  cost  of  imposing  the  restriction  RP  =  r,  it  is  no  surprise  that  the  right 
hand  side  of  (7.37)  was  already  encountered  in  (7.29)  and  is  distributed  as  p2- 
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7.9  Likelihood  Ratio,  Wald  and  Lagrange  Multiplier  Tests 

Before  we  go  into  the  derivations  of  these  three  classical  tests  for  the  null  hypothesis  Hq ;  R/3  =  r, 
it  is  important  for  the  reader  to  review  the  intuitive  graphical  explanation  of  these  tests  given 
in  Chapter  2. 

The  Likelihood  Ratio  test  of  Hq;R/3  =  r  is  based  upon  the  ratio  A  =  maxfr/maxfu,  where 
maxfM  and  maxfr  are  the  maximum  values  of  the  unrestricted  and  restricted  likelihoods,  re¬ 
spectively.  Let  us  assume  for  simplicity  that  a 2  is  known ,  then 

max4  =  (l/27rcr2)n/2exp{-(?/  -  XPMLE)'(y  -  XPMLE)/2cr 2} 

where  Pmle  =  Pols-  Denoting  the  unrestricted  residual  sum  of  squares  by  URSS,  we  have 

max4  =  (l/2vru2)n/2exp{-C7R55/2u2} 

Similarly,  maxh  is  given  by 

max4  =  (l/27rcr2)n/2exp{-(y  -  XpRMLE)\y  -  XfiRMLE)/2 a2} 
where  PRmle  =  Prls-  Denoting  the  restricted  residual  sum  of  squares  by  RRSS,  we  have 
maxfr  =  (l/27rcr2)n//2exp{— RRSS /2cr2} 

Therefore,  — 21ogA  =  ( RRSS  —  URSS) /a2 .  Let  us  find  the  relationship  between  these  residual 
sums  of  squares. 

er  =  y  -  Xf3RLS  =  y  -  X(3Qls  -  X((3RLS  -  Pols)  =  e  -  X(f3RLS  -  POLS)  (7.38) 
e'rer  =  e'e  +  {PRLS  -  Pols)' x{Prls  ~  Pols) 

where  er  denotes  the  restricted  residuals  and  e'rer  the  RRSS.  The  cross-product  terms  drop  out 
because  X'e  =  0.  Substituting  the  value  of  (PREE  —  Pols)  from  (7.36)  into  (7.38),  we  get: 

RRSS  -  URSS  =  ( RPols  -  r)'[R{X' X)-1  R,]~1{r'{5ols  ~  r)  (7-39) 

It  is  now  clear  that  — 21ogA  is  the  right  hand  side  of  (7.39)  divided  by  a2.  In  fact,  this  Likelihood 
Ratio  (LR)  statistic  is  the  same  as  that  given  in  (7.37)  and  (7.29).  Under  the  null  hypothesis 
R/3  =  r  ,  this  was  shown  to  be  a 

The  Wald  test  of  R/3  =  r  is  based  upon  the  unrestricted  estimator  and  the  extent  of  which  it 
satisfies  the  restriction.  More  formally,  if  r{(3)  =  0  denote  the  vector  of  g  restrictions  on  (3  and 
R(Pmle)  denotes  the  ( g  x  k )  matrix  of  partial  derivatives  dr{(3)/d(3'  evaluated  at  Pmlei  then 
the  Wald  statistic  is  given  by 

II  =  r((3MLE)'[R((3MLE)I((3MLE )  1R(PmleY]  ^r(P mle)  (7.40) 

where  /(/?)  =  —E(d2logL/df3d/3').  In  this  case,  r((3)  =  R/3  —  r,  R(/3MLE )  =  R  and  I{Pmle)  = 
(X'X)/a2  as  seen  in  (7.19).  Therefore, 

W  =  (RPmle  ~  rYlRiX'xr^rH^PMLE  ~  r)/a2  (7.41) 

which  is  the  same  as  the  LR  statistic5. 
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The  Lagrange  Multiplier  test  is  based  upon  the  restricted  estimator.  In  section  7.8,  we  derived 
the  restricted  estimator  and  the  estimated  Lagrange  Multiplier  /L  The  Lagrange  Multiplier  y  is 
the  cost  or  shadow  price  of  imposing  the  restrictions  Rf5  =  r.  If  these  restrictions  are  true,  one 
would  expect  the  estimated  Lagrange  Multiplier  y  to  have  mean  zero.  Therefore,  a  test  for  the 
null  hypothesis  that  y  =  0,  is  called  the  LM  test  and  the  corresponding  test  statistic  is  given 
in  equation  (7.37).  Alternatively,  one  can  derive  the  LM  test  as  a  score’  test  based  on  the  score 
or  the  first  derivative  of  the  log-likelihood  function  i.e. ,  S(/3)  =  <91og L/df3.  The  score  is  zero  for 
the  unrestricted  MLE,  and  the  score  test  is  based  upon  the  departure  of  S(/3),  evaluated  at  the 
restricted  estimator  (3rmlei  from  zero.  In  this  case,  the  score  form  of  the  LM  statistic  is  given 
by 

LM  =  S((3RMLE)'l(f3RMLE)  1S(/3rmle)  (7.42) 


For  our  model,  S(/3)  =  ( X'y  —  X'Xf3)/a 2  and  from  equation  (7.36)  we  have 

S(Prmle)  =  X'(y  -  X(5RMLE) / a2 

=  {X'y  -  X'XpOLS  +  R,[R{X'X)-1R!]-\R%ls  -  r)}/a 2 
=  R![R{X'X)-1Rl}-\R^OLS-r)/a2 

Using  (7.20),  one  gets  I~1{Prmle)  =  &2{X' X)~l .  Therefore,  the  score  form  of  the  LM  test 
becomes 

LM  =  ( RPols  ~  ry[R(X,X)~lR,]-1R(X,X)~1R,[R(X,X)~1R,]-1(RPOLS  -  r)/a 2 
=  (RfloLS  -  r)'[R{X'X)-1R!]-\RPOLS  -  r)/a2  (7.43) 

This  is  numerically  identical  to  the  LM  test  derived  in  equation  (7.37)  and  to  the  W  and  LR 
statistics  derived  above.  Note  that  S {(3Rmle)  =  R'T1/(T'2  from  (7.35),  so  it  is  clear  why  the  Score 
and  the  Lagrangian  Multiplier  tests  are  identical. 

The  score  form  of  the  LM  test  can  also  be  obtained  as  a  by-product  of  an  artificial  regression. 
In  fact,  S({3)  evaluated  as  /3RMLE  is  given  by 

S(Prmle)  =  X\y  —  Xj3RMLE)/(7 2 

where  y  —  Xf3RMLE  is  the  vector  of  restricted  residuals.  If  Ho  is  true,  then  this  converges 
asymptotically  to  u  and  the  asymptotic  variance  of  the  vector  of  scores  becomes  ( X'X)/a 2. 
The  score  test  is  then  based  upon 

(y  -  x'pRMLE)'X{X'X)-1X\y  -  X^RMLE)/a2  (7.44) 

This  expression  is  the  explained  sum  of  squares  from  the  artificial  regression  of  (y—X/3RMLE)/cr 
on  X.  To  see  that  this  is  exactly  identical  to  the  LM  test  in  equation  (7.37),  recall  from  equation 
(7.33)  that  R'y  =  X'(y  —  X(3RMLE)  and  substituting  this  expression  for  R'y  on  the  left  hand 
side  of  equation  (7.37)  we  get  equation  (7.44).  In  practice,  cr2  is  estimated  by  s2  the  Mean 
Square  Error  of  the  restricted  regression.  This  is  an  example  of  the  Gauss-Newton  Regression 
which  will  be  discussed  in  Chapter  8. 

An  alternative  approach  to  testing  Hq,  is  to  estimate  the  restricted  and  unrestricted  models 
and  compute  the  following  E-statistic 


Robs 


(. RRSS  -  URSS)/g 
URSS /{n—  k) 


(7.45) 
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This  statistic  is  known  in  the  econometric  literature  as  the  Chow  (1960)  test  and  was  encoun¬ 
tered  in  Chapter  4.  Note  that  from  equation  (7.39),  if  we  divide  the  numerator  by  a2  we  get  a 
X2g  statistic  divided  by  its  degrees  of  freedom.  Also,  using  the  fact  that  (n  —  k)s2/a 2  is  Xn-k > 
the  denominator  divided  by  a2  is  a  Xn-k  statistic  divided  by  its  degrees  of  freedom.  Problem 
1 1  shows  independence  of  the  numerator  and  denominator  and  completes  the  proof  that  F0bs  is 
distributed  F(g,n  —  k )  under  Hq. 


Chow’s  (1960)  Test  for  Regression  Stability 

Chow  (1960)  considered  the  problem  of  testing  the  equality  of  two  sets  of  regression  coefficients 
V\  =  XiP1  +  iii  and  y2  =  X2(32  +  u2  (7.46) 


where  X\  is  m  x  k  and  X2  is  n2  x  k  with  m  and  n2  >  k.  In  this  case,  the  unrestricted  regression 
can  be  written  as 


yi 

.  2/2  . 

= 

0 

X2  _ 

'  Pi  ' 

.  p* . 

+ 

1  1 

i-H  <M 

3  53 

i _ i 

(7.47) 


under  the  null  hypothesis  Hq-,  /31  =  /32  =  /?,  the  restricted  model  becomes 


V  i 
V2 


X\ 

X2 


P  + 


u  i 
u2 


(7.48) 


The  URSS  and  the  RRSS  are  obtained  from  these  two  regressions  by  stacking  the  n\  +  n2 
observations.  It  is  easy  to  show  that  the  URSS=  e^ei  +  e2e2  where  ei  is  the  OLS  residuals  from 
yi  on  X\  and  e2  is  the  OLS  residuals  from  y2  on  X2.  In  other  words,  the  URSS  is  the  sum  of  two 
residual  sums  of  squares  from  the  separate  regressions,  see  problem  13.  The  Chow  E-statistic 
given  in  equation  (7.45)  has  k  and  (m  +  ri2  —  2k)  degrees  of  freedom,  respectively.  Equivalently, 
one  can  obtain  this  Chow  E-statistic  from  running 


yi 

_  2/2  . 

= 

'  X\  ■ 
.  X2  . 

Pi  + 

1  1 

°  k; 

(P2  ~  Pi)  + 

Ul 

u2 

Note  that  the  second  set  of  explanatory  variables  whose  coefficients  are  (P2  —  Pi)  are  interaction 
variables  obtained  by  multiplying  each  independent  variable  in  equation  (7.48)  by  a  dummy 
variable,  say  D2,  that  takes  on  the  value  1  if  the  observation  is  from  the  second  regression  and 
0  if  it  is  from  the  first  regression.  A  test  for  Hq]  P1  =  j32  becomes  a  joint  test  of  significance 
for  the  coefficients  of  these  interaction  variables.  Gujarati  (1970)  points  out  that  this  dummy 
variable  approach  has  the  additional  advantage  of  giving  the  estimates  of  (P2  —  Pi)  and  their 
f-statistics.  If  the  Chow  E-test  rejects  stability,  these  individual  interaction  dummy  variable 
coefficients  may  point  to  the  source  of  instability.  Of  course,  one  has  to  be  careful  with  the 
interpretation  of  these  individual  f-statistics,  after  all  they  can  all  be  insignificant  with  the  joint 
E-statistic  still  being  significant,  see  Maddala  (1992). 

In  case  one  of  the  two  regressions  does  not  have  sufficient  observations  to  estimate  a  separate 
regression  say  n2  <  k,  then  one  can  proceed  by  running  the  regression  on  the  full  data  set  to 
get  the  RRSS.  This  is  the  restricted  model  because  the  extra  n2  observations  are  assumed  to  be 
generated  by  the  same  regression  as  the  first  ni  observations.  The  URSS  is  the  residual  sums 
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of  squares  based  only  on  the  longer  period  (n\  observations).  In  this  case,  the  Chow  F-statistic 
given  in  equation  (7.45)  has  77-2  and  m  —  k  degrees  of  freedom,  respectively.  This  is  known 
as  Chow’s  predictive  test  since  it  tests  whether  the  shorter  ri2  observations  are  different  from 
their  predictions  using  the  model  with  the  longer  m  observations.  This  predictive  test  can  be 
performed  with  dummy  variables  as  follows:  Introduce  n 2  observation  specific  dummies,  one  for 
each  of  the  observations  in  the  second  regression.  Test  the  joint  significance  of  these  n 2  dummy 
variables.  Salkever’s  (1976)  result  applies  and  each  dummy  variable  will  have  as  its  estimated 
coefficient  the  prediction  error  with  its  corresponding  standard  error  and  its  f-statistic.  Once, 
again,  the  individual  dummies  may  point  out  possible  outliers,  but  it  is  their  joint  significance 
that  is  under  question. 

The  W,  LR  and  LM  Inequality 

We  have  shown  that  the  LR  =  W  =  LM  for  linear  restrictions  if  the  log-likelihood  is  quadratic. 
However,  this  is  not  necessarily  the  case  for  more  general  situations.  In  fact,  in  the  next  chapter 
where  we  consider  more  general  variance  covariance  structure  on  the  disturbances,  estimating 
this  variance-covariance  matrix  destroys  this  equality  and  may  lead  to  conflict  in  hypotheses 
testing  as  noted  by  Berndt  and  Savin  (1977).  In  this  case,  W  >  LR  >  LM.  See  also  the 
problems  at  the  end  of  this  chapter.  The  LR,  W  and  LM  tests  are  based  on  the  efficient  MLE. 
When  consistent  rather  than  efficient  estimators  are  used,  an  alternative  way  of  constructing 
the  score-type  test  is  known  as  Neyman’s  C(a).  For  details,  see  Bera  and  Permaratne  (2001). 

Although,  these  three  tests  are  asymptotically  equivalent,  one  test  may  be  more  convenient 
than  another  for  a  particular  problem.  For  example,  when  the  model  is  linear  but  the  restriction 
is  nonlinear,  the  unrestricted  model  is  easier  to  estimate  than  the  restricted  model.  So  the  Wald 
test  suggests  itself  in  that  it  relies  only  on  the  unrestricted  estimator.  Unfortunately,  the  Wald 
test  has  a  drawback  that  the  LR  and  LM  test  do  not  have.  In  finite  samples,  the  Wald  test  is  not 
invariant  to  testing  two  algebraically  equivalent  formulations  of  the  nonlinear  restriction.  This 
fact  has  been  pointed  out  in  the  econometric  literature  by  Gregory  and  Veall  (1985,  1986)  and 
Lafontaine  and  White  (1986).  In  what  follows,  we  review  some  of  Gregory  and  Veall’s  (1985) 
findings: 

Consider  the  linear  regression  with  two  regressors 

Vt  =  P0  +  Pixi  t  +  f^2x2t  +  ut  (7.50) 

where  the  ut  s  are  IIN(0,cr2),  and  the  nonlinear  restriction  P ( P2  =  1.  Two  algebraically  equiv¬ 
alent  formulation  of  the  null  hypothesis  are:  HA;  rA(P)  =  P1  —  1 £(32  =  0,  and  HB\  rB(f3)  = 
PiP2  —  1  =  0.  The  unrestricted  maximum  likelihood  estimator  is  Pols  and  the  Wald  statistic 
given  in  (7.40)  is 

11  =  r(PoLsY[R(PoLs)V(PoLs)R'(PoLs)\  MMOLS' )  (7-51) 

where  V (Pols)  Is  the  usual  estimated  variance-covariance  matrix  of  Pols ■  Problem  19  asks  the 
reader  to  verify  that  the  Wald  statistics  corresponding  to  Ha  and  Hb  using  (7.51)  are 

WA  =  (P1P2  ~  l)2/(&n  +  2v12  +  v22/pl)  (7.52) 

and 


11  B  —  (P1P2  ~  1)2/(/^2vii  +  2PiP2v  12  +  P\V22) 


(7.53) 
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where  the  Vij ’s  are  the  elements  of  V {Pols)  f°r  b  j  =  0, 1, 2.  These  Wald  statistics  are  clearly  not 
identical,  and  other  algebraically  equivalent  formulations  of  the  null  hypothesis  can  be  generated 
with  correspondingly  different  Wald  statistics.  Monte  Carlo  experiments  were  performed  with 
1000  replications  on  the  model  given  in  (7.50)  with  various  values  for  (31  and  /32,  and  for  a 
sample  size  n  =  20, 30, 50, 100,  500.  The  experiments  were  run  when  the  null  hypothesis  is  true 
and  when  it  is  false.  For  n  =  20  and  [31  =  10,  /32  =  0.1,  so  that  Hq  is  satisfied,  WA  rejects  the 
null  when  it  is  true  293  times  out  of  a  1000,  while  WB  rejects  the  null  65  times  out  of  a  1000.  At 
the  5%  level  one  would  expect  50  rejections  with  a  95%  confidence  interval  [36,64],  Both  WA 
and  WB  reject  too  often  but  WA  performs  worse  than  WB .  When  n  is  increased  to  500,  WA 
rejects  78  times  while  WB  rejects  39  times  out  of  a  1000.  WA  still  rejects  too  often  although 
its  performance  is  better  than  that  for  n  =  20,  while  WB  performs  well  and  is  within  the  95% 
confidence  region.  When  n  =  20,  P1  =  1  and  /32  =  0.5,  so  that  Hq  is  not  satisfied,  WA  rejects 
the  null  when  it  is  false  65  times  out  of  a  1000  whereas  WB  rejects  it  584  times  out  of  a  1000.  For 
n  =  500,  both  test  statistics  reject  the  null  in  1000  out  of  1000  times.  Even  in  cases  where  the 
empirical  sizes  of  the  tests  appear  similar,  see  Table  1  of  Gregory  and  Veall  (1985),  in  particular 
the  case  where  P1  =  (32  =  1,  Gregory  and  Veall  find  that  WA  and  WB  are  in  conflict  about  5% 
of  the  time  for  n  =  20,  and  this  conflict  drops  to  0.5%  at  n  =  500.  Problem  20  asks  the  reader 
to  derive  four  Wald  statistics  corresponding  to  four  algebraically  equivalent  formulations  of  the 
common  factor  restriction  analyzed  by  Hendry  and  Mizon  (1978).  Gregory  and  Veall  (1986) 
give  Monte  Carlo  results  on  the  performance  of  these  Wald  statistics  for  various  sample  sizes. 
Once  again  they  find  conflict  among  these  tests  even  when  their  empirical  sizes  appear  to  be 
similar.  Also,  the  differences  among  the  Wald  statistics  are  much  more  substantial,  and  persist 
even  when  n  is  as  large  as  500. 

Lafontaine  and  White  (1985)  consider  a  simple  regression 

y  =  a  +  (3x  +  'yz  +  u 

where  y  is  log  of  per  capita  consumption  of  textiles,  x  is  log  of  per  capita  real  income  and  z  is 
log  of  relative  prices  of  textiles,  with  the  data  taken  from  Theil  (1971,  p.  102).  The  estimated 
equation  is: 

y  =  1.37  +  1.14a;  -  0.83z 
(0.31)  (0.16)  (0.04) 

with  a2  =  0.0001833,  and  n  =  17,  with  standard  errors  shown  in  parentheses.  Consider  the 
null  hypothesis  Ho]  (3=1.  Algebraically  equivalent  formulations  of  Hq  are  Hk]  (3k  =  1  for  any 
exponent  k.  Applying  (7.40)  with  r(/3)  =  /3k  —  1  and  R{(3)  =  fc/3fc_1,  one  gets  the  Wald  statistic 

Wk  =  {f  -  1  )2/[{kt~l)2v(p)}  (7.54) 

where  /3  is  the  OLS  estimate  of  /3  and  V(/?)is  its  corresponding  estimated  variance.  For  every  k, 
Wk  has  a  limiting  x2  distribution  under  Hq.  The  critical  values  are  Xi  05  =  3.84  and  -^i°i4  =  4.6. 
The  latter  is  an  exact  distribution  test  for  (3=1  under  Hq.  Lafontaine  and  White  (1985)  try 
different  integer  exponents  (±/c)  where  k  =  1,2,3,6,10,20,40.  Using  /3  =  1.14  and  V((3)  = 
(0.16)2  one  gets  W-20  =  24.56,  W\  =  0.84,  and  W20  =  0.12.  The  authors  conclude  that  one 
could  get  any  Wald  statistic  desired  by  choosing  an  appropriate  exponent.  Since  /?  >  1,  Wk  is 
inversely  related  to  k.  So,  we  can  find  a  Wk  that  exceeds  the  critical  values  given  by  the  x2  an<4 
F  distributions.  In  fact,  W-20  leads  to  rejection  whereas  Wi  and  IU20  do  not  reject  Hq. 
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For  testing  nonlinear  restrictions,  the  Wald  test  is  easy  to  compute.  However,  it  has  a  serious 
problem  in  that  it  is  not  invariant  to  the  way  the  null  hypothesis  is  formulated.  In  this  case, 
the  score  test  may  be  difficult  to  compute,  but  Neyman’s  C(a)  test  is  convenient  to  use  and 
provide  the  invariance  that  is  needed,  see  Dagenais  and  Dufour  (1991). 


Notes 

1.  For  example,  in  a  time-series  setting,  including  the  time  trend  in  the  multiple  regression  is  equiv¬ 
alent  to  detrending  each  variable  first,  by  residualing  out  the  effect  of  time,  and  then  running  the 
regression  on  these  residuals. 

2.  Two  exceptions  noted  in  Davidson  and  MacKinnon  (1993)  are  the  following:  One,  if  the  model 
is  not  identified  asymptotically.  For  example,  yt  =  (3(1  /t)  +  Ut  for  t  =  1,  2, . . . ,  T,  will  have  (1  /t) 
tend  to  zero  as  T  — >  oo.  This  means  that  as  the  sample  size  increase,  there  is  no  information  on  (3. 
Two,  if  the  number  of  parameters  in  the  model  increase  as  the  sample  size  increase.  For  example, 
the  fixed  effects  model  in  panel  data  discussed  in  Chapter  12. 

3.  If  the  MLE  of  (3  is  (3MLE ,  then  the  MLE  of  (1/(3)  is  (1  /  0mle)-  Note  that  this  invariance  property 
implies  that  MLE  cannot  be  in  general  unbiased.  For  example,  even  if  (3MLE  is  unbiased  for  (3,  by 
the  above  reparameterization,  (1  / (3MLE)  is  not  unbiased  for  (1/(3). 

4.  If  the  distribution  of  disturbances  is  not  normal,  then  OLS  is  still  BLUE  as  long  as  the  assumptions 
underlying  the  Gauss-Markov  Theorem  are  satisfied.  The  MLE  in  this  case  will  be  in  general  more 
efficient  than  OLS  as  long  as  the  distribution  of  the  errors  is  correctly  specified. 

5.  Using  the  Taylor  Series  approximation  of  r((3MLE)  around  the  true  parameter  vector  (3,  one  gets 
t(13mle)  r((3)  +  R((3)((3MLe  ~  P)-  Under  the  null  hypothesis,  r((3)  =  0  and  the  var [r((3MLE)\  — 
R((3)  var (0MLE)R'(0). 


Problems 

1.  Invariance  of  the  Fitted  Values  and  Residuals  to  Nonsingular  Transformations  of  the  Independent 
Variables.  Post-multiply  the  independent  variables  in  (7.1)  by  a  nonsingular  transformation  C,  so 
that  X*  =  XC. 

(a)  Show  that  Px*  =  Px  and  Px *  =  Px-  Conclude  that  the  regression  of  y  on  X  has  the  same 
fitted  values  and  the  same  residuals  as  the  regression  of  y  on  X* . 

(b)  As  an  application  of  these  results,  suppose  that  every  X  was  multiplied  by  a  constant,  say, 
a  change  in  the  units  of  measurement.  Would  the  fitted  values  or  residuals  change  when  we 
rerun  this  regression? 

(c)  Suppose  that  X  contains  two  regressors  Xi  and  X2  each  of  dimension  n  x  1.  If  we  run  the 
regression  of  y  on  (Xi  —  X2)  and  (Xi  +  X2),  will  this  yield  the  same  fitted  values  and  the 
same  residuals  as  the  original  regression  of  y  on  Xi  and  X2? 

2.  The  FWL  Theorem. 

(a)  Using  partitioned  inverse  results  from  the  Appendix,  show  that  the  solution  to  (7.9)  yields 

02,0 ls  Siven  in  (7-10). 
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(b)  Alternatively,  write  (7.9)  as  a  system  of  two  equations  in  two  unknowns  qls  and  /32  ols- 
Solve,  by  eliminating  (3 iqls  and  show  that  the  resulting  solution  is  given  by  (7.10). 

(c)  Using  the  FWL  Theorem,  show  that  if  X±  =  tn  a  vector  of  ones  indicating  the  presence  of 
the  constant  in  the  regression,  and  X2  is  a  set  of  economic  variables,  then  (i)  f32  qls  can 
be  obtained  by  running  yi  —  y  on  the  set  of  variables  in  X2  expressed  as  deviations  from 
their  respective  sample  means,  (ii)  The  least  squares  estimate  of  the  constant  p1  OLS  can 

be  retrieved  as  y  —  X2P2OLS  where  X2  =  L'nX2/n  is  the  vector  of  sample  means  of  the 
independent  variables  in  X2. 

3.  Let  y  =  X/3  +  Di-y  +  u  where  j/  is  n  x  1,  T  is  n  x  fc  and  Dj  is  a  dummy  variable  that  takes  the  value 
1  for  the  i-th  observation  and  0  otherwise.  Using  the  FWL  Theorem,  prove  that  the  least  squares 
estimates  of  (3  and  7  from  this  regression  are  /3OLs  =  ( X*' X*)~lX*'y *  and  7OLS  =  Vi  —  x\PolSi 
where  X*  denotes  the  X  matrix  without  the  i-th  observation  and  y*  is  the  y  vector  without 
the  i-th  observation  and  (j/i,a:()  denotes  the  i-th  observation  on  the  dependent  and  independent 
variables.  This  means  that  7 Qls  is  the  forecasted  OLS  residual  from  the  regression  of  y*  on  X* 
for  the  i-th  observation  which  was  essentially  excluded  from  the  regression  by  the  inclusion  of  the 
dummy  variable  Di. 

4.  Maximum  Likelihood  Estimation.  Given  the  log-likelihood  in  (7.17), 

(a)  Derive  the  first-order  conditions  for  maximization  and  show  that  f3MLE  =  Pols  and  that 
aMLE  =  RSS/n. 

(b)  Calculate  the  second  derivatives  given  in  (7.18)  and  verify  that  the  information  matrix  re¬ 
duces  to  (7.19). 

5.  Given  that  u  ~  1V(0,  cr2/n),  we  showed  that  (n  —  k)s2/a2  ~  Xn-k •  Use  this  ^act  to  Prove  that, 

(a)  s2  is  unbiased  for  a2. 

(b)  var(s2)  =  2er4/(n  —  k).  Hint:  E(xr)  =  r  an(l  var(Xr)  =  2r. 

6.  Consider  all  estimators  of  er2  of  the  type  a2  =  e'e/r  =  u'Pxu/r  with  u  ~  N(0,a2In). 

(a)  Find  E(a2MLE)  and  the  bias (a2MLE). 

(b)  Find  var (a2MLE)  and  the  MSE(ct^-ls). 

(c)  Compute  MSE(<r2)  and  minimize  it  with  respect  to  r.  Compare  with  the  MSE  of  s 2  and 

aMLE- 

7.  Computing  Forecasts  and  Forecast  Standard  Errors  Using  a  Regression  Package.  This  is  based  on 
Salkever  (1976).  From  equations  (7.23)  and  (7.24),  show  that 

(a)  <5 'ols  =  (Pols^'ols)  where  Pols  =  ( X'X^X'y ,  and  7OLS  =  y0-XoPOLS.  Hint:  Set  up 
the  OLS  normal  equations  and  solve  two  equations  in  two  unknowns.  Alternatively,  one  can 
use  the  FWL  Theorem  to  residual  out  the  additional  Ta  dummy  variables. 

0)  e*OLS  =  Vols^Y  and  s*2  =  s2- 

(c)  s*2(A*/A*)_1  is  given  by  the  expression  in  (7.25).  Hint:  Use  partitioned  inverse. 

(d)  Replace  y0  by  0  and  It„  by  —It0  in  (7.23)  and  show  that  7  =  y0  =  X0f30LS  whereas  all  the 
results  in  parts  (a),  (b)  and  (c)  remain  the  same. 

8.  (a)  Show  that  cov(/3OLS,e)  =  0.  (Since  both  random  variables  are  normally  distributed,  this 

proves  their  independence). 
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(b)  Show  that  Pols  and  s2  are  independent.  Hint:  A  linear  ( Bu )  and  quadratic  (u'Au)  forms 
in  normal  random  variables  are  independent  if  BA  =  0.  See  Graybill  (1961)  Theorem  4.17. 

9.  (a)  Show  that  if  one  replaces  R  by  c'  in  (7.29)  one  gets  the  square  of  the  ^-statistic  given  in 

(7.26). 

(b)  Show  that  when  we  replace  a2  by  s2,  the  xf  statistic  given  in  part  (a)  becomes  the  square 
of  a  t-statistic  which  is  distributed  as  F(l,  n  —  K) .  Hint:  The  square  of  a  N (0, 1)  is  Xi-  Also 
the  ratio  of  two  independent  x2  random  variables  divided  by  their  degrees  of  freedom  is  an 
F-statistic  with  these  corresponding  degrees  of  freedom,  see  Chapter  2. 

10.  (a)  Show  that  the  matrix  A  defined  in  (7.30)  by  u'Au/a2  is  symmetric,  idempotent  and  of  rank 

9- 

(b)  Using  the  same  proof  given  below  lemma  1,  show  that  (7.30)  is  x2. 

11.  (a)  Show  that  the  two  quadratic  forms  s 2  =  v!Pxu/(n  —  k)  and  that  given  in  (7.30)  are  inde¬ 

pendent.  Hint:  Two  positive  semi-definite  quadratic  forms  u'Au  and  u! Bu  are  independent 
if  and  only  if  AB  =  0,  see  Graybill  (1961)  Theorem  4.10. 

(b)  Conclude  that  (7.31)  is  distributed  as  an  F(g,n  —  k). 

12.  Restricted  Least  Squares. 

(a)  Show  that  PRL$  given  by  (7.36)  is  biased  unless  RP  =  r. 

(b)  Show  that  the  var(/3iJLS)  =  var(A(X'  X)-1  X'  u)  where 

A=Ik~  (X'X)~1R'[R(X'X)~1R']~1R. 

Prove  that  A2  =  A,  but  A!  ^  A.  Conclude  that 

™0rLS)  =  o2A{X'X)-xA!  =  ^{(A'X)-1 

-(X'X)-1R'[R{X'X)-1R']-1R(X'X)-1}. 

(c)  Show  that  var (Pols)~  y&y(Prls)  a  positive  semi-definite  matrix. 

13.  The  Chow  Test. 

(a)  Show  that  OLS  on  (7.47)  yields  OLS  on  each  equation  separately  in  (7.46).  In  other  words, 
Kols  =  (^Xi)-1*^!  and  p2,OLS  =  ( X'2X2)-lX'2y2. 

(b)  Show  that  the  residual  sum  of  squares  for  equation  (7.47)  is  given  by  RSS\  +  RSS2,  where 
RSSi  is  the  residual  sum  of  squares  from  running  yi  on  A,  for  *  =  1,2. 

(c)  Show  that  the  Chow  F-statistic  can  be  obtained  from  (7.49)  by  testing  for  the  joint  signifi¬ 
cance  of  Ha;  P2  —  /?i  =  0. 

14.  Suppose  we  would  like  to  test  H0\  /32  =  0  in  the  following  unrestricted  model  given  also  in  (7.8) 

y  =  X (3  +  u  =  X\pi  +  X2P2  A  u 

(a)  Using  the  FWL  Theorem,  show  that  the  URSS  is  identical  to  the  residual  sum  of  squares 
obtained  from  PxTV  =  Uv,  X2/32  +  Px1R-  Conclude  that 

URSS  =  y'Pxy  =  y'PXly  -  2/,Fx1A2(A'PXiA2)-1A'Pa-i2/. 

(b)  Show  that  the  numerator  of  the  F-statistic  for  testing  H0\  /32  =  0  which  is  given  in  (7.45), 
is  y' PXl X2 (X2 PAl X2 ) - 1  X'2 PXly/k2- 

Substituting  y  =  X\(31  +  u  under  the  null  hypothesis,  show  that  the  above  expression  reduces 
to  u' Px1X2(X'2Px1X2)-1X'2Px1u/k2. 
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(c)  Let  v  =  X'2Px i«,  show  that  if  u  ~  IIN(0,cr2)  then  v  ~  N(0,a2X2Px1X2).  Conclude  that 
the  numerator  of  the  F-statistic  given  in  part  (b)  when  divided  by  tr2  can  be  written  as 
u'[var(u)]_1u//c2  where  f/[var(u)]_1u  is  distributed  as  y founder  H0.  Hint:  See  the  discussion 
below  lemma  1. 

(d)  Using  the  result  that  ( n  —  k)s2/a2  ~  Xn-k  where  s2  is  the  URSS /(n  —  k),  show  that  the 
F-statistic  given  by  (7.45)  is  distributed  as  F(k2,n  —  k)  under  Ha.  Hint:  You  need  to  show 
that  u' Pxu  is  independent  of  the  quadratic  term  given  in  part  (b),  see  problem  11. 

(e)  Show  that  the  Wald  Test  for  Ha\  /32  =  0,  given  in  (7.41),  reduces  in  this  case  to  W  = 

f32[R(X'X)-1R']~1/32/s2  were  R  =  [0 ,/fc2],  (32  denotes  the  OLS  or  equivalently  the  MLE 
of  (32  from  the  unrestricted  model  and  s2  is  the  corresponding  estimate  of  a2  given  by 
URSS /{n  —  k).  Using  partitioned  inversion  or  the  FWL  Theorem,  show  that  the  numerator 
of  W  is  k2  times  the  expression  in  part  (b) . 

(f)  Show  that  the  score  form  of  the  LM  statistic,  given  in  (7.42)  and  (7.44),  can  be  obtained 
as  the  explained  sum  of  squares  from  the  artificial  regression  of  the  restricted  residuals 
{y  —  Xij31RLS)  deflated  by  s  on  the  matrix  of  regressors  X.  In  this  case,  s2  =  RRSS /(n  —  k±) 
is  the  Mean  Square  Error  of  the  restricted  regression.  In  other  words,  obtain  the  explained 
sum  of  squares  from  regressing  Px^y /s  on  X\  and  X2. 

15.  Rerative  Estimation  in  Partitioned  Regression  Models.  This  is  based  on  Fiebig  (1995).  Consider  the 
partitioned  regression  model  given  in  (7.8)  and  let  X2  be  a  single  regressor,  call  it  x2  of  dimension 
n  x  1  so  that  (32  is  a  scalar.  Consider  the  following  strategy  for  estimating  (32:  Estimate  j31  from 
the  shortened  regression  of  y  on  X1.  Regress  the  residuals  from  this  regression  on  x2  to  yield  b2  1  ■ 

(a)  Prove  that  is  biased. 

Now  consider  the  following  iterative  strategy  for  re-estimating  (32 : 

Re-estimate  j31  by  regressing  y  —  x2b ^  on  X\  to  yield  .  Next  iterate  according  to  the 
following  scheme: 

b[j)  =  (X[X1)-1X[(y-x2b%)) 

b2+1)  =  (x2x2)~1x2(y  —  Xib^),  j  =  1,2,... 

(b)  Determine  the  behavior  of  the  bias  of  b2+ ^  as  j  increases. 

(c)  Show  that  as  j  increases  b2+i>  converges  to  the  estimator  of  (32  obtained  by  running  OLS 
on  (7.8). 

16.  Maddala  (1992,  pp.  120  127).  Consider  the  simple  linear  regression 

Yi  =  a  +  pXi  +  m  i  =  1, 2, . . . ,  n. 
where  a  and  (3  are  scalars  and  Ui  ~  IIN(0,  cr2).  For  Ha\  (3  =  0, 

(a)  Derive  the  Likelihood  Ratio  (LR)  statistic  and  show  that  it  can  be  written  as  nlog[l/(l  —  r2)] 
where  r2  is  the  square  of  the  correlation  coefficient  between  X  and  y. 

(b)  Derive  the  Wald  (W)  statistic  for  testing  Ha ;  (3  =  0.  Show  that  it  can  be  written  as  nr2/(  1  — 
r2).  This  is  the  square  of  the  usual  i-statistic  on  (3  with  aMLE  =  '}2r(=ie2/n  used  instead  of 
s2  in  estimating  cr2.  (3  is  the  unrestricted  MLE  which  is  OLS  in  this  case,  and  the  e^s  are 
the  usual  least  squares  residuals. 

(c)  Derive  the  Lagrange  Multiplier  (LM)  statistic  for  testing  H0\  (3  =  0.  Show  that  it  can  be 

written  as  nr2.  This  is  the  square  of  the  usual  f-statistic  on  (3  with  o,2RMLE  =  Y)2 /n 

used  instead  of  s2  in  estimating  a2.  The  cr2RMLE  is  restricted  MLE  of  a2  (i.e. ,  imposing  H0 
and  maximizing  the  likelihood  with  respect  to  cr2). 
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(d)  Show  that  LM/n  =  (W/n)/[  1  +  (W/n)\,  and  LR/n  =  log[l  +  (W/n)\.  Using  the  following 
inequality  x  >  log(l  +  x)  >  x/(l  +  x),  conclude  that  W  >  LR  >  LM.  Hint:  Use  x  =  W/n. 

(e)  For  the  cigarette  consumption  data  given  in  Table  3.2,  compute  the  W,  LR,  LM  for  the 
simple  regression  of  logC  on  logP  and  demonstrate  the  above  inequality  given  in  part  (d) 
for  testing  that  the  price  elasticity  is  zero? 

17.  Engle  (1984,  VP-  785-786).  Consider  a  set  of  T  independent  observations  on  a  Bernoulli  random 
variable  which  takes  on  the  values  yt  =  1  with  probability  9 ,  and  yt  =  0  with  probability  (1  —  6). 

(a)  Derive  the  log-likelihood  function,  the  MLE  of  9 ,  the  score  S(0),  and  the  information  1(9). 

(b)  Compute  the  LR,  W  and  LM  test  statistics  for  testing  H0;  9  =  90 ,  versus  Ha',  9  ^  90  for 
0e(  0,1). 

18.  Engle  (1984,  PP-  787-788).  Consider  the  linear  regression  model 

y  =  X  (3  +  u  =  X\(3i  +  X2P2  T  u 
given  in  (7.8),  where  u  ~  iV(0,  a2 It )■ 

(a)  Write  down  the  log-likelihood  function,  find  the  MLE  of  (3  and  a2 . 

(b)  Write  down  the  score  S(/3)  and  show  that  the  information  matrix  is  block-diagonal  between 
(3  and  a2. 

(c)  Derive  the  W,  LR  and  LM  test  statistics  in  order  to  test  Ha-,  f31  =  / 3°,  versus  Ha',  (3x  ^  (3°, 
where  (31  is  say  the  first  k\  elements  of  /?.  Show  that  if  X  =  [X\ ,  X2} ,  then 

W  =  ((3l-  PJIX'^XM  -  \)/u2 
LM  =  u'X^PxzX^Xiu/a2 
LR  =  T  log (u'u/u'u) 

where  u  =  y  —  X(3,  u  =  y  —  X(3  and  a1  =  u'u/T,  a2  =  u'u/T.  (3  is  the  unrestricted  MLE, 
whereas  (3  is  the  restricted  MLE. 

(d)  Using  the  above  results,  show  that 

W  =  T(u'u  —  u'u)/u'u 
LM  =  T(u'u  —  u'u)  /u'u 

Also,  that  LR  =  T  log[l  +  (W/T)\-  LM  =  W/[  1  +  (W/T)\-  and  (T  -  k)WlTkx  ~  Fkl,T-k 
under  Ha.  As  in  problem  16,  we  use  the  inequality  x  >  log(l  +  x)  >  x/(l  +  x)  to  conclude 
that  W  >  LR  >  LM.  Hint:  Use  x  =  W/T.  However,  it  is  important  to  note  that  all  the  test 
statistics  are  monotonic  functions  of  the  P-statistic  and  exact  tests  for  each  would  produce 
identical  critical  regions. 

(e)  For  the  cigarette  consumption  data  given  in  Table  3.2,  run  the  following  regression: 

logC  =  a  +  (3\ogP  +  ylogU  +  u 

compute  the  W,  LR,  LM  given  in  part  (c)  for  the  null  hypothesis  H0\  (3  =  — 1. 

(f)  Compute  the  Wald  statistics  for  H°-,  (3  =  — 1,  H°;  (35  =  —1  and  H (3~b  =  —1.  How  do 
these  statistics  compare? 

19.  Gregory  and  Veall  (1985).  Using  equation  (7.51)  and  the  two  formulations  of  the  null  hypothe¬ 
sis  Ha  and  Hb  given  below  (7.50),  verify  that  the  Wald  statistics  corresponding  to  these  two 
formulations  are  those  given  in  (7.52)  and  (7.53),  respectively. 
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20.  Gregory  and  Veall  (1986).  Consider  the  dynamic  equation 

yt  =  PVt- 1  +  Pi  xt  +  P2xt- 1  +  ut 

where  \p\  <  1,  and  ut  ~  NID(0,ct2).  Note  that  for  this  equation  to  be  the  Cochrane-Orcutt 
transformation 

yt  -  pyt- 1  =  /3i(xt  -  pxt- 1)  +  ut 

the  following  nonlinear  restriction  must  be  satisfied  —pip  =  P2  called  the  common  factor  restric¬ 
tion  by  Hendry  and  Mizon  (1978).  Now  consider  the  following  four  formulations  of  this  restriction 

Ha-  /V  +  02  =  0;  pl  +  {p2/p)  =  0;  Hc ;  p  +  (/32//?i)  =  0  and  HD;  (Pip/P2)  +  1  =  0. 

(a)  Using  equation  (7.51)  derive  the  four  Wald  statistics  corresponding  to  the  four  formulations 
of  the  null  hypothesis. 

(b)  Apply  these  four  Wald  statistics  to  the  equation  relating  real  personal  consumption  expen¬ 
ditures  to  real  disposable  personal  income  in  the  U.S.  over  the  post  World  War  II  period 
1959-2007,  see  Table  5.3. 

21.  Effect  of  Additional  Regressors  on  R2.  This  problem  was  considered  in  non-matrix  form  in  Chapter 
4,  problem  4.  Regress  y  on  Xi  which  is  T  x  Ki  and  compute  SSEi.  Add  X2  which  is  T  x  K2  so 
that  the  number  of  regressors  in  now  K  =  K\  +  K2.  Regress  y  on  X  =  [Xi,X2]  and  get  SSE2. 
Show  that  SSE-2  <  SSEi.  Conclude  that  the  corresponding  i?-squares  satisfy  R\  >  R\.  Hint: 
Show  that  Px  —  Pxx  is  a  positive  semi-definite  matrix. 
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Appendix 

Some  Useful  Matrix  Properties 


This  book  assumes  that  the  reader  has  encountered  matrices  before,  and  knows  how  to  add,  subtract 
and  multiply  conformable  matrices.  In  addition,  that  the  reader  is  familiar  with  the  transpose,  trace, 
rank,  determinant  and  inverse  of  a  matrix.  Unfamiliar  readers  should  consult  standard  texts  like  Bellman 
(1970)  or  Searle  (1982).  The  purpose  of  this  Appendix  is  to  review  some  useful  matrix  properties  that  are 
used  in  the  text  and  provide  easy  access  to  these  properties.  Most  of  these  properties  are  given  without 
proof. 

Starting  with  Chapter  7,  our  data  matrix  X  is  organized  such  that  it  has  n  rows  and  k  columns,  so 
that  each  row  denotes  an  observation  on  k  variables  and  each  column  denotes  n  observations  on  one 
variable.  This  matrix  is  of  dimension  n  x  k.  The  rank  of  an  n  x  k  matrix  is  always  less  than  or  equal 
to  its  smaller  dimension.  Since  n  >  k,  the  rank  (V)  <  k.  When  there  is  no  perfect  multicollinearity 
among  the  variables  in  X,  this  matrix  is  said  to  be  of  full  column  rank  k.  In  this  case,  X'X,  the  matrix 
of  cross-products  is  of  dimension  k  x  k.  It  is  square,  symmetric  and  of  full  rank  k.  This  uses  the  fact 
that  the  rank(A,V)  =  rank(Y)  =  k.  Therefore,  (X'X)  is  nonsingular  and  the  inverse  (X'X)-1  exists. 
This  is  needed  for  the  computation  of  Ordinary  Least  Squares.  In  fact,  for  least  squares  to  be  feasible, 
X  should  be  of  full  column  rank  k  and  no  variable  in  X  should  be  a  perfect  linear  combination  of  the 
other  variables  in  X.  If  we  write 


X  = 


where  x\  denotes  the  i- th  observation,  in  the  data,  then  X'X  =  Xu=i  xix\  where  Xi  is  a  column  vector 
of  dimension  k  x  1. 
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An  important  and  widely  encountered  matrix  is  the  Identity  matrix  which  will  be  denoted  by  In  and 
subscripted  by  its  dimension  n.  This  is  a  square  nxn  matrix  whose  diagonal  elements  are  all  equal  to  one 
and  its  off  diagonal  elements  are  all  equal  to  zero.  Also,  <J2In  will  be  a  familiar  scalar  covariance  matrix , 
with  every  diagonal  element  equal  to  a2  reflecting  homoskedasticity  or  equal  variances  (see  Chapter  5), 
and  zero  covariances  or  no  serial  correlation  (see  Chapter  5).  Let 

=  diag[of]  = 

be  an  (n  x  n)  diagonal  matrix  with  the  Ath  diagonal  element  equal  to  a 2  for  i  =  1, 2, . . . ,  n.  This  matrix 
will  be  encountered  under  heteroskedasticity,  see  Chapter  9.  Note  that  tr(fi)  =  ]C"=1  of  is  the  sum  of  its 
diagonal  elements.  Also,  tr(/n)  =  n  and  tr(er2/n)  =  no2 .  Another  useful  matrix  is  the  projection  matrix 
Px  =  X(X'X)~1X'  which  is  of  dimension  n  x  n.  This  matrix  is  encountered  in  Chapter  7.  If  y  denotes 
the  nxl  vector  of  observations  on  the  dependent  variable,  then  Pxy  generates  the  predicted  values  y 
from  the  least  squares  regression  of  y  on  A.  This  matrix  Px  is  symmetric  and  idempotent.  This  means 
that  P'x  =  Px  and  Px  =  PxPx  =  Px  as  can  be  easily  verified.  Some  of  the  properties  of  idempotent 
matrices  is  that  their  rank  is  equal  to  their  trace.  Hence,  rank(Px)  =  tr (Px)  =  tr[X{X' X)~x X']  = 
tr[X'  X(X'  X~x)\  =  tr(4)  =  k. 

Here,  we  used  the  fact  that  tr  (ABC)  =  tr  (CAB)  =  tr  (BCA).  In  other  words,  the  trace  is  unaffected  by 
the  cyclical  permutation  of  the  product.  Of  course,  these  matrices  should  be  conformable  and  the  product 
should  result  in  a  square  matrix.  Note  that  Px  =  In  ~  Px  is  also  a  symmetric  and  idempotent  matrix. 
In  this  case,  Pxy  —  y~  Pxy  =  y  —  y  =  e  where  e  denotes  the  least  squares  residuals,  y  —  Xj3OLS  where 
Pols  =  (A'A) ~1X'y,  see  Chapter  7.  Some  properties  of  these  projection  matrices  are  the  following: 

PxX  =  X ,  PxX  =  0,  .Px e  =  e  and  Px&  =  0. 

In  fact,  X'e  =  0  means  that  the  matrix  X  is  orthogonal  to  the  vector  of  least  squares  residuals  e.  Note 
that  X'e  =  0  means  that  X'(y  —  X(30ls)  =  0  or  X'y  =  X' X/30ls-  These  k  equations  are  known  as  the 
OLS  normal  equations  and  their  solution  yields  the  least  squares  estimates  Pols-  By  the  definition  of 
Px,  we  have  (i)  Px  +  Px  =  In-  Also,  (ii)  Px  and  Px  are  idempotent  and  (iii)  PxPx  =  0.  In  fact,  any 
two  of  these  properties  imply  the  third.  The  rank(Px)  =  tr  (Px)  =  tr(/n  —  Px)  =  n  —  k.  Note  that  Px 
and  Px  are  of  rank  k  and  (n  —  k),  respectively.  Both  matrices  are  not  of  full  column  rank.  In  fact,  the 
only  full  rank,  symmetric  idempotent  matrix  is  the  identity  matrix. 

Matrices  not  of  full  rank  are  singular,  and  their  inverse  do  not  exist.  However,  one  can  find  a  generalized 
inverse  of  a  matrix  which  we  will  call  f2_  which  satisfies  the  following  requirements: 

(i)  Illicit  =fl  (ii)  Q-QQ-  = 

(iii)  is  symmetric  and  (iv)  f ICI~  is  symmetric. 

Even  if  H  is  not  square,  a  unique  can  be  found  for  H  which  satisfies  the  above  four  properties.  This 
is  called  the  Moore-Penrose  generalized  inverse. 

Note  that  a  symmetric  idempotent  matrix  is  its  own  Moore-Penrose  generalized  inverse.  For  example,  it 
is  easy  to  verify  that  if  H  =  Px,  then  ft~  =  Px  and  that  it  satisfies  the  above  four  properties.  Idempotent 
matrices  have  characteristic  roots  that  are  either  zero  or  one.  The  number  of  non-zero  characteristic  roots 
is  equal  to  the  rank  of  this  matrix.  The  characteristic  roots  of  H”1  are  the  reciprocals  of  the  characteristic 
roots  of  fl,  but  the  characteristic  vectors  of  both  matrices  are  the  same. 

The  determinant  of  a  matrix  is  non-zero  if  and  only  if  it  has  full  rank.  Therefore,  if  A  is  singular, 
then  |A|  =  0.  Also,  the  determinant  of  a  matrix  is  equal  to  the  product  of  its  characteristic  roots. 
For  two  square  matrices  A  and  B1  the  determinant  of  the  product  is  the  product  of  the  determinants 
\AB\  =  |A|  •  \B\.  Therefore,  the  determinant  of  H”1  is  the  reciprocal  of  the  determinant  of  fb  This  follows 
from  the  fact  that  |n||fl-1|  =  IOH”1!  =  j/|  =  1.  This  property  is  used  in  writing  the  likelihood  function 
for  Generalized  Least  Squares  (GLS)  estimation,  see  Chapter  9.  The  determinant  of  a  triangular  matrix 
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is  equal  to  the  product  of  its  diagonal  elements.  Of  course,  it  immediately  follows  that  the  determinant 
of  a  diagonal  matrix  is  the  product  of  its  diagonal  elements. 

The  constant  in  the  regression  corresponds  to  a  vector  of  ones  in  the  matrix  of  regressors  A".  This 
vector  of  ones  is  denoted  by  tn  where  n  is  the  dimension  of  this  column  vector.  Note  that  i'nin  =  n  and 
ini'n  =  Jn  where  J„  is  a  matrix  of  ones  of  dimension  n  x  n.  Note  also  that  Jn  is  not  idempotent,  but 
Jn  =  Jn/n  is  idempotent  as  can  be  easily  verified.  The  rank(J„)  =  tr (Jn)  =  1.  Note  also  that  In  —  Jn 
is  idempotent  with  rank  (n  —  1).  Jny  has  a  typical  element  y  =  Vi/n  whereas  (/„  —  Jn)y  has  a 
typical  element  (yt  —  y).  So  that  Jn  is  the  averaging  matrix,  whereas  premultiplying  by  (Jn  —  Jn)  results 
in  deviations  from  the  mean. 

For  two  nonsingular  matrices  A  and  B 

(AB)-1  =  B~1A~l 

Also,  the  transpose  of  a  product  of  two  conformable  matrices,  ( AB )'  =  B' A' .  In  fact,  for  the  product  of 
three  conformable  matrices  this  becomes  ( ABC )'  =  C'B'A1.  The  transpose  of  the  inverse  is  the  inverse 
of  the  transpose,  i.e.,  (A-1)'  =  (A')-1. 

The  inverse  of  a  partitioned  matrix 


is 


A  = 


An 

A2i 


Al2 

A22 


A"1 
where  E  = 
A"1 


E  —EA12A22 

—A22  A2iE  A22  +  A221A2i£,Ai2A221 

(An  —  Ai2A^21A2i)_1.  Alternatively,  it  can  be  expressed  as 

An1  +  A111Ai2F1A2iA111  — A111Ai2A 
-FA2iA^  F 


where  F  =  (A22  —  A2iA("11Ai2)_1.  These  formulas  are  used  in  partitioned  regression  models,  see  for 
example  the  Frisch- Waugh  Lovell  Theorem  and  the  computation  of  the  variance-covariance  matrix  of 
forecasts  from  a  multiple  regression  in  Chapter  7. 

An  n  x  n  symmetric  matrix  f t  has  n  distinct  characteristic  vectors  Ci , . . . ,  cn .  The  corresponding  n 
characteristic  roots  Ai, . . . ,  An  may  not  be  distinct  but  they  are  all  real  numbers.  The  number  of  non¬ 
zero  characteristic  roots  of  ft  is  equal  to  the  rank  of  ft.  The  characteristic  roots  of  a  positive  definite 
matrix  are  positive.  The  characteristic  vectors  of  the  symmetric  matrix  ft  are  orthogonal  to  each  other, 
i.e.,  c[cj  =  0  for  i  ^  j  and  can  be  made  orthonormal  with  c(ci  =  1  for  i  =  1,2, ...  ,n.  Hence,  the 
matrix  of  characteristic  vectors  C  =  [ci,c2, . . .  ,cn]  is  an  orthogonal  matrix,  such  that  CC'  =  C'C  =  In 
with  C  =  C~x.  By  definition  ftci  =  AjCj  or  flC  =  C A  where  A  =  diag[Aj].  Premultiplying  the  last 
equation  by  C  we  get  C'flC  =  C'C  A  =  A.  Therefore,  the  matrix  of  characteristic  vectors  C  diagonalizes 
the  symmetric  matrix  ft.  Alternatively,  we  can  write  ft  =  CAC'  =  y^—i  A iCic[  which  is  the  spectral 
decomposition  of  ft. 

A  real  symmetric  n  x  n  matrix  ft  is  positive  semi-definite  if  for  every  n  x  1  non-negative  vector  y,  we 
have  y'fty  >  0.  If  y'fty  is  strictly  positive  for  any  non-zero  y  then  ft  is  said  to  be  positive  definite.  A 
necessary  and  sufficient  condition  for  ft  to  be  positive  definite  is  that  all  the  characteristic  roots  of  ft  are 
positive.  One  important  application  is  the  comparison  of  efficiency  of  two  unbiased  estimators  of  a  vector 
of  parameters  (3.  In  this  case,  we  subtract  the  variance-covariance  matrix  of  the  inefficient  estimator  from 
the  more  efficient  one  and  show  that  the  resulting  difference  yields  a  positive  semi-definite  matrix,  see 
the  Gauss-Mar kov  Theorem  in  Chapter  7. 

If  ft  is  a  symmetric  and  positive  definite  matrix,  there  exists  a  nonsingular  matrix  P  such  that 
ft  =  PP' .  In  fact,  using  the  spectral  decomposition  of  ft  given  above,  one  choice  for  P  =  CA1/2  so 
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that  H  =  CAC'  =  PP' .  This  is  a  useful  result  which  we  use  in  Chapter  9  to  obtain  Generalized 
Least  Squares  (GLS)  as  a  least  squares  regression  after  transforming  the  original  regression  model  by 
P-1  =  (CA1/2)-1  =  A -1/2C".  In  fact,  if  u  ~  (0,er2fl),  then  P_1u  has  zero  mean  and  var(P-1it)  = 
P-1,var(u)P-1/  =  cr2p-1np-1'  =  o^p-ipp'p-1'  =  ^  /n. 

From  Chapter  2,  we  have  seen  that  if  u  ~  N(0,  <r2In),  then  iq/cr  ~  IV(0, 1),  so  that  u2/a2  ~  Xi 
and  u'u/a2  =  ~  x2-  Therefore,  u'(a2In)~1u  ~  Xn-  If  «  ~  1V(0,  cr2H)  where  H  is  positive 

definite,  then  it*  =  P-1w  ~  N(0,a2In)  and  u*'u*  /a2  ~  x2-  But  =  u,P~1,P~1u  =  u'fl~1u.  Hence, 
u,fl~1u/a2  ~  X™  ■  This  is  used  in  Chapter  9. 

Note  that  the  OLS  residuals  are  denoted  by  e  =  Pxu.  If  it  ~  N(0,cr2In),  then  e  has  mean  zero  and 
var(e)  =  a2PxInPx  =  &2PX  so  that  e  ~  7V(0,  a2Px )•  Our  estimator  of  cr2  in  Chapter  7  is  s2  =  e'e/(n—k) 
so  that  (n  —  k)s2 /a2  =  e'e/tr2.  The  last  term  can  also  be  written  as  u1  Pxu  /  a2 .  In  order  to  find  the 
distribution  of  this  quadratic  form  in  Normal  variables,  we  use  the  following  result  stated  as  lemma  1  in 
Chapter  7. 

Lemma  1:  For  every  symmetric  idempotent  matrix  A  of  rank  r,  there  exists  an  orthogonal  matrix  P 
such  that  P' AP  =  Jr  where  Jr  is  a  diagonal  matrix  with  the  first  r  elements  equal  to  one  and  the  rest 
equal  to  zero. 

We  use  this  lemma  to  show  that  the  e'e/cr2  is  a  chi-squared  with  ( n  —  k)  degrees  of  freedom.  To  see 
this  note  that  e'e/cr2  =  u'Pxu/a2  and  that  Px  is  symmetric  and  idempotent  of  rank  (n  —  k).  Using 
the  lemma  there  exists  a  matrix  P  such  that  P'  PXP  =  Jn-k  is  a  diagonal  matrix  with  the  first  (n  —  k) 
elements  on  the  diagonal  equal  to  1  and  the  last  k  elements  equal  to  zero.  An  orthogonal  matrix  P  is 
by  definition  a  matrix  whose  inverse,  is  its  own  transpose,  i.e.,  P'P  =  In.  Let  v  =  P'u  then  v  has  mean 
zero  and  var(u)  =  a2 P'P  =  a2In  so  that  v  is  N(0,<r2In)  and  u  =  Pv.  Therefore, 

e'e/cr2  =  u'Pxu/cr2  =  v'P'PxPv/a2  =  v'Jn_kv/<j 2  = 

But,  the  v's  are  independent  identically  distributed  IV (0,  a2),  hence  vf/a2  is  the  square  of  a  standardized 
IV(0,1)  random  variable  which  is  distributed  as  a  xf-  Moreover,  the  sum  of  independent  x2  random 
variables  is  a  x2  random  variable  with  degrees  of  freedom  equal  to  the  sum  of  the  respective  degrees  of 
freedom,  see  Chapter  2.  Hence,  e’e/a 2  is  distributed  as  Xn-k- 

The  beauty  of  the  above  result  is  that  it  applies  to  all  quadratic  forms  u' Au  where  A  is  symmetric 
and  idempotent.  In  general,  for  u  ~  7V(0, cr2/),  a  necessary  and  sufficient  condition  for  u'Au/a 2  to 
be  distributed  xt  is  that  A  is  idempotent  of  rank  k,  see  Theorem  4.6  of  Graybill  (1961).  Another 
useful  theorem  on  quadratic  forms  in  normal  random  variables  is  the  following:  If  it  ~  N( 0,cr2H),  then 
u'Au/a2  is  Xfc  if  and  only  if  AH  is  an  idempotent  matrix  of  rank  k,  see  Theorem  4.8  of  Graybill  (1961).  If 
u  ~  IV(0,ct2/),  the  two  positive  semi-definite  quadratic  forms  in  normal  random  variables  say  it' Ait  and 
u' Bu  are  independent  if  and  only  if  AB  =  0,  see  Theorem  4.10  of  Graybill  (1961).  A  sufficient  condition 
is  that  tr(AP)  =  0,  see  Theorem  4.15  of  Graybill  (1961).  This  is  used  in  Chapter  7  to  construct  F- 
statistics  to  test  hypotheses,  see  for  example  problem  11.  For  it  ~  IV(0,  a2/),  the  quadratic  form  u' Au 
is  independent  of  the  linear  form  Bu  if  BA  =  0,  see  Theorem  4.17  of  Graybill  (1961).  This  is  used  in 
Chapter  7  to  prove  the  independence  of  s2  and  /3o;s,  see  problem  8.  In  general,  if  it  ~  IV(0,  S),  then  u' Au 
and  u' Bu  are  independent  if  and  only  if  ASP  =  0,  see  Theorem  4.21  of  Graybill  (1961).  Many  other 
useful  matrix  properties  can  be  found.  This  is  only  a  sample  of  them  that  will  be  implicitly  or  explicitly 
used  in  this  book. 

The  Kronecker  product  of  two  matrices  say  E®  In  where  E  is  m  x  m  and  In  is  the  identity  matrix  of 
dimension  n  is  defined  as  follows: 


E®  In 


0 11  In 

TrrCI  '  n 


T 1  m  -In 

^mm^n 


In  other  words,  we  place  an  /„  next  to  every  element  of  E  =  [cr-iy ]  -  The  dimension  of  the  resulting  matrix 
is  mn  x  mn.  This  is  useful  when  we  have  a  system  of  equations  like  Seemingly  Unrelated  Regressions  in 
Chapter  10.  In  general,  if  A  is  m  x  n  and  B  is  px  q  then  A®  B  is  mp  x  nq.  Some  properties  of  Kronecker 
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products  include  (A®  B)'  =  A!  ®  B' .  If  both  A  and  B  are  square  matrices  of  order  m  x  m  and  p  x  p 
then  (A(g)  B)~x  =  A~x  ®  B~x,  |A(g)  B\  =  |A|m|B|p  and  tr(A®  B)  =  tr(A)tr(B).  Applying  this  result  to 
E  ®  In  we  get 


(E  (g)  Jra)_1  =  £-1  g)  In  and  |E  ®  In\  =  |E|m|/n|"  =  |E|m 

and  tr(E  ®  /„)  =  tr(E)tr(/n)  =  n  tr(E). 

Some  useful  properties  of  matrix  differentiation  are  the  following: 


dx'b 

~db~ 


=  x 


where  x'  is  1  x  k  and  b  is  k  x  1. 


Also 


db'Ab 

db 


(A  +  A1)  where  A  is  k  x  k. 


If  A  is  symmetric,  then  db'Ab/db  =  2 Ab.  These  two  properties  will  be  used  in  Chapter  7  in  deriving  the 
least  squares  estimator. 
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CHAPTER  8 


Regression  Diagnostics  and  Specification  Tests 

8.1  Influential  Observations1 

Sources  of  influential  observations  include:  (i)  improperly  recorded  data,  (ii)  observational  errors 
in  the  data,  (iii)  misspecification  and  (iv)  outlying  data  points  that  are  legitimate  and  contain 
valuable  information  which  improve  the  efficiency  of  the  estimation.  It  is  constructive  to  isolate 
extreme  points  and  to  determine  the  extent  to  which  the  parameter  estimates  depend  upon 
these  desirable  data. 

One  should  always  run  descriptive  statistics  on  the  data,  see  Chapter  2.  This  will  often  reveal 
outliers,  skewness  or  multimodal  distributions.  Scatter  diagrams  should  also  be  examined,  but 
these  diagnostics  are  only  the  first  line  of  attack  and  are  inadequate  in  detecting  multivariate 
discrepant  observations  or  the  way  each  observation  affects  the  estimated  regression  model. 

In  regression  analysis,  we  emphasize  the  importance  of  plotting  the  residuals  against  the  ex¬ 
planatory  variables  or  the  predicted  values  y  to  identify  patterns  in  these  residuals  that  may 
indicate  nonlinearity,  heteroskedasticity,  serial  correlation,  etc,  see  Chapter  3.  In  this  section, 
we  learn  how  to  identify  significantly  large  residuals  and  compute  regression  diagnostics  that 
may  identify  influential  observations.  We  study  the  extent  to  which  the  deletion  of  any  observa¬ 
tion  affects  the  estimated  coefficients,  the  standard  errors,  predicted  values,  residuals  and  test 
statistics.  These  represent  the  core  of  diagnostic  tools  in  regression  analysis. 

Accordingly,  Belsley,  Kuh  and  Welsch  (1980,  p.  11)  define  an  influential  observation  as  “..one 
which,  either  individually  or  together  with  several  other  observations,  has  demonstrably  larger 
impact  on  the  calculated  values  of  various  estimates  (coefficients,  standard  errors,  i- values,  etc.) 
than  is  the  case  for  most  of  the  other  observations.” 

First,  what  is  a  significantly  large  residual?  We  have  seen  that  the  least  squares  residuals  of 
y  on  X  are  given  by  e  =  (In  —  Px)u,  see  equation  (7.7).  y  is  n  x  1  and  X  is  n  x  k.  If  u  ~ 
IID(0,  u2/n),  then  e  has  zero  mean  and  variance  cr2(/n  —  Px)-  Therefore,  the  OLS  residuals  are 
correlated  and  heteroskedastic  with  var(e*)  =  cr2(l  —  ha)  where  ha  is  the  z-th  diagonal  element 
of  the  hat  matrix  H  =  Px,  since  y  =  Hy. 

The  diagonal  elements  ha  have  the  following  properties: 

E?=i  ha  =  tT(px)  =  k  and  ha  =  YTj=\  h1j  >  hii  >  0. 

The  last  property  follows  from  the  fact  that  Px  is  symmetric  and  idempotent.  Therefore,  /z2  — 
ha  <  0  or  hiiiha  —  1)  <0.  Hence,  0  <  ha  <  1,  (see  problem  1).  hu  is  called  the  leverage  of  the 
z-th  observation.  For  a  simple  regression  with  a  constant, 

fc«  =  (l/n)  +  (x?/n=ix?) 

where  x;t  =  Xj  —  X\  hu  can  be  interpreted  as  a  measure  of  the  distance  between  X  values  of 
the  z-th  observation  and  their  mean  over  all  n  observations.  A  large  hu  indicates  that  the  z-th 
observation  is  distant  from  the  center  of  the  observations.  This  means  that  the  z-th  observation 
with  large  hu  (a  function  only  of  Xi  values)  exercises  substantial  leverage  in  determining  the 
fitted  value  y\.  Also,  the  larger  hu,  the  smaller  the  variance  of  the  residual  et.  Since  observations 
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with  high  leverage  tend  to  have  smaller  residuals,  it  may  not  be  possible  to  detect  them  by  an 
examination  of  the  residuals  alone.  But,  what  is  a  large  leverage?  ha  is  large  if  it  is  more 
than  twice  the  mean  leverage  value  2 h  =  2 k/n.  Hence,  ha  >  2 k/n  are  considered  outlying 
observations  with  regards  to  X  values. 

An  alternative  representation  of  hni  is  simply  ha  =  d^Pxdi  =  ||Pxdj||2  =  x'i(X'X)~1Xi  where 
di  denotes  the  i-th  observation’s  dummy  variable,  i.e.,  a  vector  of  dimension  n  with  1  in  the  i-th 
position  and  0  elsewhere.  x\  is  the  i-th  row  of  X  and  ||.||  denotes  the  Euclidian  length.  Note 
that  d[X  =  x\. 

Let  us  standardize  the  i-th  OLS  residual  by  dividing  it  by  an  estimate  of  its  variance.  A 
standardized  residual  would  then  be: 


"hi 


(8.1) 


where  cr2  is  estimated  by  s2,  the  MSE  of  the  regression.  This  is  an  internal  studentization  of  the 
residuals,  see  Cook  and  Weisberg  (1982).  Alternatively,  one  could  use  an  estimate  of  <r2  that 
is  independent  of  ep  Defining  s2^  as  the  MSE  from  the  regression  computed  without  the  i-th 
observation,  it  can  be  shown,  see  equation  (8.18)  below,  that 

„2  _  (n-k)s2 -e2/(l-hu)  _  n2  (n-k-ej\ 

%)~  (n-k-1)  U-*-l  )  (  } 

Under  normality,  s2^  and  e*  are  independent  and  the  externally  studentized  residuals  are  defined 
by 

&i  =  c«/ ■®(i)'\/l  ha  ~  in— fc— l  (8-3) 


Thus,  if  the  normality  assumption  holds,  we  can  readily  assess  the  significance  of  any  single 
studentized  residual.  Of  course,  the  e*  will  not  be  independent.  Since  this  is  a  ^statistic,  it  is 
natural  to  think  of  e*  as  large  if  its  value  exceeds  2  in  absolute  value. 

Substituting  (8.2)  into  (8.3)  and  comparing  the  result  with  (8.1),  it  is  easy  to  show  that  e* 
is  a  monotonic  transformation  of 


e„-  =  e,; 


n  —  k  —  1 


n  —  k  —  e 


22 


(8.4) 


Cook  and  Wiesberg  (1982)  show  that  e*  can  be  obtained  as  a  t-statistic  from  the  following 
augmented  regression: 


y  =  X{3*  +  dup  +  u  (8.5) 

where  di  is  the  dummy  variable  for  the  i-th  observation.  In  fact,  ip  =  e*/(l  —  ha)  and  e*  is  the 
t-statistic  for  testing  that  cp  =  0.  (see  problem  4  and  the  proof  given  below).  Hence,  whether 
the  z-th  residual  is  large  can  be  simply  determined  by  the  regression  (8.5).  A  dummy  variable 
for  the  i-th  observation  is  included  in  the  original  regression  and  the  t-statistic  on  this  dummy 
tests  whether  this  i-th  residual  is  large.  This  is  repeated  for  all  observations  i  =  1 , ,n. 

This  can  be  generalized  easily  to  testing  for  a  group  of  significantly  large  residuals: 


y  =  XP*  +  Dpp*  +  u 


(8.6) 
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where  Dp  is  an  n  x  p  matrix  of  dummy  variables  for  the  p-suspected  observations.  One  can  test 
tp*  =  0  using  the  Chow  test  described  in  (4.17)  as  follows: 


F  = 


[Residual  SS(no  dummies)  —  Residual  SS (Dp  dummies  used)]/p 


Residual  SS (Dp  dummies  used) /(n  —  k  —  p) 

This  will  be  distributed  as  Fp,r t-k-p  under  the  null,  see  Gentleman  and  Wilk  (1975).  Let 
ep  =  D'pe,  then  E(ep)  =  0  and  var(ep)  =  a2  D'pP\Dp 
Then  one  can  show,  (see  problem  5),  that 
Wp(D'pPxDp)-1ep]/p 


F  = 


F 


pm—k—p 


(8.7) 


(8.8) 


(8.9) 


[(n  —  k)s2  —  e'p{D’pPxDp)  lep]/{n  -  k  -  p) 

Another  refinement  comes  from  estimating  the  regression  without  the  i- th  observation: 

3(0  =  lx(i)x(irlx’am  (8.10) 

where  the  (i)  subscript  notation  indicates  that  the  i- th  observation  has  been  deleted.  Using  the 
updating  formula 


{A  -  a'b)-1  =  A-1  +  A-1  a! (I  -  bA~1a')~1bA~1 
with  A  =  (A'A)  and  a  =  b  =  one  gets 

[A^A^)]-1  =  (A' A)”1  +  (A/A)-1xix'(A'A)-1/(l  -  hu) 

Therefore 


(8.11) 


(8.12) 


P  -  %)  =  (A'A)-1xiei/(l  -  ha)  (8.13) 

Since  the  estimated  coefficients  are  often  of  primary  interest,  (8.13)  describes  the  change  in  the 
estimated  regression  coefficients  that  would  occur  if  the  i-th  observation  is  deleted.  Note  that 
a  high  leverage  observation  with  ha  large  will  be  influential  in  (8.13)  only  if  the  corresponding 
residual  e*  is  not  small.  Therefore,  high  leverage  implies  a  potentially  influential  observation, 
but  whether  this  potential  is  actually  realized  depends  on  m. 

Alternatively,  one  can  obtain  this  result  from  the  augmented  regression  given  in  (8.5).  Note 
that  =  dpd^di)^1  d[  =  is  an  n  x  n  matrix  with  1  in  the  i-th  diagonal  position  and  0 
elsewhere.  Pd,  =  In  —  Pdt ,  has  the  effect  when  post-multiplied  by  a  vector  y  of  deleting  the  i-th 
observation.  Hence,  premultiplying  (8.5)  by  Pd,  one  gets 

Pd.v  =  (  vg>  )  =  (  )  P  +  (  f  )  <8-14) 


where  the  i-th  observation  is  moved  to  the  bottom  of  the  data,  without  loss  of  generality.  The 

last  observation  has  no  effect  on  the  least  squares  estimate  of  /3*  since  both  the  dependent  and 

— *  — 

independent  variables  are  zero.  This  regression  will  yield  /?  =  /3^,  and  the  z-th  observation’s 
residual  is  clearly  zero.  By  the  Frisch- Waugh-Lovell  Theorem  given  in  section  7.3,  the  least 
squares  estimates  and  the  residuals  from  (8.14)  are  numerically  identical  to  those  from  (8.5). 
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Therefore,  (3  =  (3^  in  (8.5)  and  the  z-th  observation  residual  from  (8.5)  must  be  zero.  This 
implies  that  =  z/j— x'fluy  and  the  fitted  values  from  this  regression  are  given  by  y  =  X(3^+ditp 
whereas  those  from  the  original  regression  (7.1)  are  given  by  X(3.  The  difference  in  residuals  is 
therefore 

e  -  e(i)  =  X/3^  +  dyp  -  X/3  (8.15) 

premultiplying  (8.15)  by  Px  and  using  the  fact  that  PxX  =  0,  one  gets  Px(e  —  e^)  =  Pxdpp. 
But,  Pxe  =  e  and  Px&{i)  =  e(i)-,  hence  Pxdpp  =  e  —  eyy  Premultiplying  both  sides  by  d'  one  gets 
d[Pxdpp  =  ei  since  the  z-th  residual  of  from  (8.5)  is  zero.  By  definition,  d[Pxdi  =  1  —  ha , 
therefore 


<p  =  ei/(l  -  ha)  (8.16) 

premultiplying  (8.15)  by  (X' X)~l X'  one  gets  0  =  (3^  —  (3  +  {X'  X)~2  X'  dyp.  This  uses  the  fact 
that  both  residuals  are  orthogonal  to  X.  Rearranging  terms  and  substituting  ip  from  (8.16),  one 
gets 

3  -  %)  =  (- X'X)~lxpp  =  {X'X)-lxiei/{  1  -  ha) 


as  given  in  (8.13). 

Note  that  s2^  given  in  (8.2)  can  now  be  written  in  terms  of  (3^y. 


4)  =  'EtfriVt  ~  -k-1) 

upon  substituting  (8.13)  in  (8.17)  we  get 

huCi 


(8.17) 


(n-k-  l)s(i)  =  E"=i  i 

=  (n  —  k)s2  + 


-  h 

2  ei 


(1  -  ha)2 


1  —  hi 


E™= l  ethit  + 


(1  -  ha)2 


V"  h 2  - 

2-jt= i  nit 


(1  -  ha)2 


=  (8'18) 

which  is  (8.2).  This  uses  the  fact  that  He  =  0  and  H2  =  H.  Hence,  E”=i  ethu  =  0  and 
V n  h2  -h 

Z^z= 1  nit  ~  "'**•  ^  ^ 

To  assess  whether  the  change  in  /3  ■  (the  j-th  component  of  (3)  that  results  from  the  deletion 

of  the  z-th  observation,  is  large  or  small,  we  scale  by  the  variance  of  (3j.  a2(X'X )J2 .  This  is 
denoted  by 


DFBETASij  =  Qj  ~  %))/W(X/X4/  (8‘19) 

Note  that  is  used  in  order  to  make  the  denominator  stochastically  independent  of  the 
numerator  in  the  Gaussian  case.  Absolute  values  of  DFBETAS  larger  than  2  are  considered 
influential.  However,  Belsley,  Kuh,  and  Welsch  (1980)  suggest  2 /y/n  as  a  size-adjusted  cutoff. 
In  fact,  it  would  be  most  unusual  for  the  removal  of  a  single  observation  from  a  sample  of  100 
or  more  to  result  in  a  change  in  any  estimate  by  two  or  more  standard  errors.  The  size- adjusted 
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cutoff  tend  to  expose  approximately  the  same  proportion  of  potentially  influential  observations, 
regardless  of  sample  size.  The  size-adjusted  cutoff  is  particularly  important  for  large  data  sets. 

In  case  of  Normality,  it  can  also  be  useful  to  look  at  the  change  in  the  f-statistics,  as  a  means 
of  assessing  the  sensitivity  of  the  regression  output  to  the  deletion  of  the  z-th  observation: 


DFSTATij 


(8.20) 


Another  way  to  summarize  coefficient  changes  and  gain  insight  into  forecasting  effects  when  the 
z-th  observation  is  deleted  is  to  look  at  the  change  in  fit,  defined  as 


DFFITi  =  yi-  y(i)  =  x[\^  -  3(»)]  =  huei/{  1  -  ha)  (8.21) 

where  the  last  equality  is  obtained  from  (8.13). 

We  scale  this  measure  by  the  variance  of  yu),  i.e. ,  ay/hii,  giving 


DFFITSi 


1  hj/ 


1/2 


\/l  -  hi 


Hi) 


1  hi. 


(8.22) 


where  a  has  been  estimated  by  and  e*  denotes  the  externally  studentized  residual  given 
in  (8.3).  Values  of  DFFITS  larger  than  2  in  absolute  value  are  considered  influential.  A  size- 
adjusted  cutoff  for  DFFITS  suggested  by  Belsley,  Kuh  and  Welsch  (1980)  is  2y/k/n. 

In  (8.3),  the  studentized  residual  e*  was  interpreted  as  a  ^-statistic  that  tests  for  the  sig¬ 
nificance  of  the  coefficient  ip  of  di,  the  dummy  variable  which  takes  the  value  1  for  the  z-th 
observation  and  0  otherwise,  in  the  regression  of  y  on  X  and  di.  This  can  now  be  easily  proved 
as  follows: 

Consider  the  Chow  test  for  the  significance  of  <p.  The  RRSS  =  (n  —  k)s2,  the  URSS  = 
(n  —  k  —  l)S(j)  and  the  Chow  F-test  described  in  (4.17)  becomes 


[(n-k)s2-  (n-  k-l)s2{i)]/l  _  e2 
(n  -  k  -  l)s^/(n  -  k  -  1)  s^(l  -  ha) 


(8.23) 


The  square  root  of  (8.23)  is  e*  ~  •  These  studentized  residuals  provide  a  better  way  to 

examine  the  information  in  the  residuals,  but  they  do  not  tell  the  whole  story,  since  some  of 
the  most  influential  data  points  can  have  small  e*  (and  very  small  e*). 

One  overall  measure  of  the  impact  of  the  z-th  observation  on  the  estimated  regression  co¬ 
efficients  is  Cook’s  (1977)  distance  measure  Df.  Recall,  that  the  confidence  region  for  all  k 
regression  coefficients  is  (/3  —  /3)'X'X(/3  —  (3)/ks 2  ~  F(k,n  —  k).  Cook’s  (1977)  distance  mea¬ 
sure  D‘l  uses  the  same  structure  for  measuring  the  combined  impact  of  the  differences  in  the 
estimated  regression  coefficients  when  the  z-th  observation  is  deleted: 

Df(s)  =  0-  %))/V/V(3  -  P(i))/ks2  (8.24) 

Even  though  D2(s)  does  not  follow  the  above  F-distribution,  Cook  suggests  computing  the  per¬ 
centile  value  from  this  F-distribution  and  declaring  an  influential  observation  if  this  percentile 
value  >  50%.  In  this  case,  the  distance  between  [3  and  /3^  will  be  large,  implying  that  the  z-th 


184 


Chapter  8:  Regression  Diagnostics  and  Specification  Tests 


observation  has  a  substantial  influence  on  the  fit  of  the  regression.  Cook’s  distance  measure  can 
be  equivalently  computed  as: 


A2(s)  =  rw 


li 


ks2  Y(1  -  ha)2 


(8.25) 


D2(s)  depends  on  a  and  ha ;  the  larger  e*  or  ha  the  larger  is  D2(s).  Note  the  relationship 
between  Cook’s  D2(s )  and  Belsley,  Kuh,  and  Welsch  (1980)  DFFITSi(cr)  in  (8.22),  i.e. , 


DFFITSi(a)  =  VkDi(a)  =  {%  -  x'%)/(a0^) 


Belsley,  Kuh,  and  Welsch  (1980)  suggest  nominating  DFFITS  based  on  exceeding  2^/k/n 
for  special  attention.  Cook’s  50  percentile  recommendation  is  equivalent  to  DFFITS  >  \/k, 
which  is  more  conservative,  see  Velleman  and  Welsch  (1981). 

Next,  we  study  the  influence  of  the  z-th  observation  deletion  on  the  covariance  matrix  of  the 
regression  coefficients.  One  can  compare  the  two  covariance  matrices  using  the  ratio  of  their 
determinants: 


COVRATIOi 


det(S2)[X'?)Xw]"1) 

det(s2[AWY]-1) 


*8  /det[X^X(i)]-A 
s2k  ^  detlX'X}-1  ) 


Using  the  fact  that 


detpT^)]  =  (1  -  ha) det[X'X] 
see  problem  8,  one  obtains 


COVRATIOi 


f n—k—1 
l  n—k 


i 

„*2  \  k 
7VFJ  (-1-  _ 


(8.26) 


(8.27) 


(8.28) 


where  the  last  equality  follows  from  (8.18)  and  the  definition  of  e*  in  (8.3).  Values  of  COVRA- 
TIO  not  near  unity  identify  possible  influential  observations  and  warrant  further  investigation. 
Belsley,  Kuh  and  Welsch  (1980)  suggest  investigating  points  with  |  COVRATIO  — 1|  near  to  or 
larger  than  3 k/n.  The  COVRATIO  depends  upon  both  ha  and  e|2.  In  fact,  from  (8.28),  COV¬ 
RATIO  is  large  when  ha  is  large  and  small  when  e*  is  large.  The  two  factors  can  offset  each 
other,  that  is  why  it  is  important  to  look  at  ha  and  e*  separately  as  well  as  in  combination  as 
in  COVRATIO. 

Finally,  one  can  look  at  how  the  variance  of  yi  changes  when  an  observation  is  deleted, 
var (%)  =  s2ha  and  var(y(i))  =  var(x'/3(i))  =  s2i){hii/{ l  -  ha)) 
and  the  ratio  is 


FVARATIOi  =  s2{i)/s2(l  -  ha)  (8.29) 

This  expression  is  similar  to  COVRATIO  except  that  [.A/s2]  is  not  raised  to  the  fc-th  power. 
As  a  diagnostic  measure  it  will  exhibit  the  same  patterns  of  behavior  with  respect  to  different 
configurations  of  ha  and  the  studentized  residual  as  described  for  COVRATIO. 
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Table  8.1  Cigarette  Regression 


Dependent  Variable:  LNC 
Analysis  of  Variance 


Sum  of 

Mean 

Source 

DF 

Squares 

Square 

F  Value 

Prob>F 

Model 

2 

0.50098 

0.25049 

9.378 

0.0004 

Error 

43 

1.14854 

0.02671 

C  Total 

45 

1.64953 

Root  MSE 

0.16343 

R-square 

0.3037 

Dep  Mean 

4.84784 

Aclj  R-sq 

0.2713 

C.V. 

3.37125 

Parameter 

Estimates 

Parameter 

Standard 

T  for  HO: 

Variable 

DF 

Estimate 

Error 

Parameter=0 

Prob  >  |  T  | 

INTERCEP 

1 

4.299662 

0.90892571 

4.730 

0.0001 

LNP 

1 

-1.338335 

0.32460147 

-4.123 

0.0002 

LNY 

1 

0.172386 

0.19675440 

0.876 

0.3858 

Example  1:  For  the  cigarette  data  given  in  Table  3.2,  Table  8.1  gives  the  SAS  least  squares 
regression  for  logC  on  logP  and  logY. 

logC  =  4.30  —  1.34  logP  +  0.172  logY  +  residuals 
(0.909)  (0.325)  (0.197) 

The  standard  error  of  the  regression  is  s  =  0.16343  and  R?  =  0.271.  Table  8.2  gives  the  data 
along  with  the  predicted  values  of  logC,  the  least  squares  residuals  e,  the  internal  studentized 
residuals  e  given  in  (8.1),  the  externally  studentized  residuals  e*  given  in  (8.3),  the  Cook  statis¬ 
tic  given  in  (8.25),  the  leverage  of  each  observation  h,  the  DFFITS  given  in  (8.22)  and  the 
COVRATIO  given  in  (8.28). 

Using  the  leverage  column,  one  can  identify  four  potential  observations  with  high  leverage,  i.e. , 
greater  than  2 h  =  2 k/n  =  6/46  =  0.13043.  These  are  the  observations  belonging  to  the  following 
states:  Connecticut  (CT),  Kentucky  (KY),  New  Hampshire  (NH)  and  New  Jersey  (NJ)  with 
leverage  0.13535,0.19775,0.13081  and  0.13945,  respectively.  Note  that  the  corresponding  OLS 
residuals  are  —0.078,0.234,0.160  and  —0.059,  which  are  not  necessarily  large.  The  internally 
studentized  residuals  are  computed  using  equation  (8.1).  For  KY  this  gives 

„  e  ky  0.23428 

eKy  =  ,  = - 7  =  1.6005 

sy/l  ~  hKY  0.16343V1  -  0.19775 

From  Table  8.2,  two  observations  with  a  high  internally  studentized  residuals  are  those  belonging 
to  Arkansas  (AR)  and  Utah  (UT)  with  values  of  2.102  and  —2.679  respectively,  both  larger  than 
2  in  absolute  value. 

The  externally  studentized  residuals  are  computed  from  (8.3).  For  KY,  we  first  compute 
s'Iky):  ^e  MSE  from  the  regression  computed  without  the  KY  observation.  From  (8.2),  this  is 
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given  by 

o2 

S{I<Y ) 


(n  -  k)s 2  -  e2KY/{l  -  hKY) 

(n  —  k  —  1) 

(46  -  3)(0.16343)2  -  (0.23428)2/(l  -  0.19775) 
(46-3-1) 


From  (8.3)  we  get 


0.025716 


e-KY 


0.23428 


'(KY) 


S(ky)V  1  -  hKY  0.16036^/1  -  0.19775 


=  1.6311 


This  externally  studentized  residual  is  distributed  as  a  i-statistic  with  42  degrees  of  freedom. 
However,  e*KY  does  not  exceed  2  in  absolute  value.  Again,  e*AR  and  eRT  are  2.193  and  —2.901 
both  larger  than  2  in  absolute  value.  From  (8.13),  the  change  in  the  regression  coefficients  due 
to  the  omission  of  the  KY  observation  is  given  by 

P  “  P{KY)  =  (X'X)  XXKYZKY I  (1  —  hKY ) 

Using  the  fact  that 


(x'xy1  = 


30.929816904 

4.81102114655 

-6.679318415 


4.8110214655 

3.9447686638 

-1.177208398 


-6.679318415 

-1.177208398 

1.4493372835 


and  x'KY  =  (1,-0.03260,4.64937)  with  exY  =  0.23428  and  h ky  =  0.19775  one  gets 

0  ~  P(ky)Y  =  (-0.082249,  -0.230954, 0.028492) 

In  order  to  assess  whether  this  change  is  large  or  small,  we  compute  DFBETAS  given  in  (8.19). 
For  the  KY  observation,  these  are  given  by 


DFBETAS  kya 


Pi  ~  Pi(ky) 
8(KY)y/(X'X)£ 


-0.082449 

0.16036V30.9298169 


-0.09222 


Similarly,  DFBETAS  ky, 2  =  —0.7251  and  DFBETAS  ky, 3  =  0.14758.  These  are  not  larger  than  2 
in  absolute  value.  However,  DFBETAS  ky, 2  is  larger  than  2/y/n  =  2/\/46  =  0.2949  in  absolute 
value.  This  is  the  size-adjusted  cutoff  recommended  by  Belsley,  Kuh  and  Welsch  (1980)  for 
large  n. 

The  change  in  the  fit  due  to  the  omission  of  the  KY  observation  is  given  by  (8.21).  In  fact, 

DFFIT ky  =  Vky  ~  V(ky)  =  x'ky[P  ~  P(KY)\ 

-0.082249 
-0.230954 
-0.028492 

or  simply 


=  (1,-0.03260,4.64937) 


j  =  0.05775 


— -  -  ra 


(0.19775)  (0.23428) 
1  -  0.19775 


=  0.05775 
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Scaling  it  by  the  variance  of  jjfKY)  we  get  from  (8.22) 


DF FIT  Sky 


(  hKY  V/2  „  _  /  0-19775 

V 1  -  hKY  )  &KY  V 1  -  0.19775 


1/2 

(1.6311)  =  0.8098 


This  is  not  larger  than  2  in  absolute  value,  but  it  is  larger  than  the  size-adjusted  cutoff  of 
2 \Jkjn  =  2 -y/3 /46  =  0.511.  Note  also  that  both  DFFITS ar  =  0.667  and  DFFITSut  =  —0.888 
are  larger  than  0.511  in  absolute  value. 

Cook’s  distance  measure  is  given  in  (8.25)  and  for  KY  can  be  computed  as 


D2ky{s) 


eKY  (  hKY  \_(  (0.23428)2  \  /  0.19775  \ 

~ks>  \  (1  —  Uky)2  )  ~  \3(0.16343)2  /  V(1  -  0.19775)2  ) 


0.21046 


The  other  two  large  Cook’s  distance  measures  are  DR2ar{s)  =  0.13623  and  DRT{s )  =  0.22399, 
respectively.  COVRATIO  omitting  the  KY  observation  can  be  computed  from  (8.28)  as 


COVRATIOky  = 


k 

l 


1  -  hKY 


(  0.025716  \3/  1 

\  (0.16343)2  J  V(1  -  0.019775) 


1.1125 


which  means  that  COVRATIO ky  —  1/  =  0.1125  is  less  than  3 k/n  =  9/46  =  0.1956. 
Finally,  FVARATIO  omitting  the  KY  observation  can  be  computed  from  (8.29)  as 


FVARATIO  ky 


S(KY) 

s2(l  -  hKY ) 


0.025716 

(0.16343)2(1  -0.19775) 


1.2001 


By  several  diagnostic  measures,  AR,  KY  and  UT  are  influential  observations  that  deserve  special 
attention.  The  first  two  states  are  characterized  with  large  sales  of  cigarettes.  KY  is  a  producer 
state  with  a  very  low  price  on  cigarettes,  while  UT  is  a  low  consumption  state  due  to  its  high 
percentage  of  Mormon  population  (a  religion  that  forbids  smoking).  Table  8.3  gives  the  predicted 
consumption  along  with  the  95%  confidence  band,  the  OLS  residuals,  and  the  internalized 
student  residuals,  Cook’s  U-statistic  and  a  plot  of  these  residuals.  This  last  plot  highlights  the 
fact  that  AR,  UT  and  KY  have  large  studentized  residuals. 


8.2  Recursive  Residuals 

In  Section  8.1,  we  showed  that  the  least  squares  residuals  are  heteroskedastic  with  non-zero  co- 
variances,  even  when  the  true  disturbances  have  a  scalar  covariance  matrix.  This  section  studies 
recursive  residuals  which  are  a  set  of  linear  unbiased  residuals  with  a  scalar  covariance  matrix. 
They  are  independent  and  identically  distributed  when  the  true  disturbances  themselves  are 
independent  and  identically  distributed.2  These  residuals  are  natural  in  time-series  regressions 
and  can  be  constructed  as  follows: 

1.  Choose  the  first  t  >  k  observations  and  compute  /3t  =  (X'tXt)~1X,tYt  where  Xt  denotes 
the  t  x  k  matrix  of  t  observations  on  k  variables  and  Yj.  =  (yi, . . .  ,yt).  The  recursive 
residuals  are  basically  standardized  one-step  ahead  forecast  residuals: 


wt+i  =  (yt+ 1  -  x't+1Pt)/J  1  +  x't+1(X'Xt)  1xt+1 


(8.30) 
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Table  8.2  Diagnostic  Statistics  for  the  Cigarettes  Example 


OBS 

STATE 

LNC 

LNP 

LNY 

PREDICTED 

e 

e 

e* 

Cook’s  D 

Leverage 

DFFITS 

COVRATIO 

1 

AL 

4.96213 

0.20487 

4.64039 

4.8254 

0.1367 

0.857 

0.8546 

0.012 

0.0480 

0.1919 

1.0704 

2 

AZ 

4.66312 

0.16640 

4.68389 

4.8844 

-0.2213 

-1.376 

-1.3906 

0.021 

0.0315 

-0.2508 

0.9681 

3 

AR 

5.10709 

0.23406 

4.59435 

4.7784 

0.3287 

2.102 

2.1932 

0.136 

0.0847 

0.6670 

0.8469 

4 

CA 

4.50449 

0.36399 

4.88147 

4.6540 

-0.1495 

-0.963 

-0.9623 

0.033 

0.0975 

-0.3164 

1.1138 

5 

CT 

4.66983 

0.32149 

5.09472 

4.7477 

-0.0778 

-0.512 

-0.5077 

0.014 

0.1354 

-0.2009 

1.2186 

6 

DE 

5.04705 

0.21929 

4.87087 

4.8458 

0.2012 

1.252 

1.2602 

0.018 

0.0326 

0.2313 

0.9924 

7 

DC 

4.65637 

0.28946 

5.05960 

4.7845 

-0.1281 

-0.831 

-0.8280 

0.029 

0.1104 

-0.2917 

1.1491 

8 

FL 

4.80081 

0.28733 

4.81155 

4.7446 

0.0562 

0.352 

0.3482 

0.002 

0.0431 

0.0739 

1.1118 

9 

GA 

4.97974 

0.12826 

4.73299 

4.9439 

0.0358 

0.224 

0.2213 

0.001 

0.0402 

0.0453 

1.1142 

10 

ID 

4.74902 

0.17541 

4.64307 

4.8653 

-0.1163 

-0.727 

-0.7226 

0.008 

0.0413 

-0.1500 

1.0787 

11 

IL 

4.81445 

0.24806 

4.90387 

4.8130 

0.0014 

0.009 

0.0087 

0.000 

0.0399 

0.0018 

1.1178 

12 

IN 

5.11129 

0.08992 

4.72916 

4.9946 

0.1167 

0.739 

0.7347 

0.013 

0.0650 

0.1936 

1.1046 

13 

IA 

4.80857 

0.24081 

4.74211 

4.7949 

0.0137 

0.085 

0.0843 

0.000 

0.0310 

0.0151 

1.1070 

14 

KS 

4.79263 

0.21642 

4.79613 

4.8368 

-0.0442 

-0.273 

-0.2704 

0.001 

0.0223 

-0.0408 

1.0919 

15 

KY 

5.37906 

-0.03260 

4.64937 

5.1448 

0.2343 

1.600 

1.6311 

0.210 

0.1977 

0.8098 

1.1126 

16 

LA 

4.98602 

0.23856 

4.61461 

4.7759 

0.2101 

1.338 

1.3504 

0.049 

0.0761 

0.3875 

1.0224 

17 

ME 

4.98722 

0.29106 

4.75501 

4.7298 

0.2574 

1.620 

1.6527 

0.051 

0.0553 

0.4000 

0.9403 

18 

MD 

4.77751 

0.12575 

4.94692 

4.9841 

-0.2066 

-1.349 

-1.3624 

0.084 

0.1216 

-0.5070 

1.0731 

19 

MA 

4.73877 

0.22613 

4.99998 

4.8590 

-0.1202 

-0.769 

-0.7653 

0.018 

0.0856 

-0.2341 

1.1258 

20 

MI 

4.94744 

0.23067 

4.80620 

4.8195 

0.1280 

0.792 

0.7890 

0.005 

0.0238 

0.1232 

1.0518 

21 

MN 

4.69589 

0.34297 

4.81207 

4.6702 

0.0257 

0.165 

0.1627 

0.001 

0.0864 

0.0500 

1.1724 

22 

MS 

4.93990 

0.13638 

4.52938 

4.8979 

0.0420 

0.269 

0.2660 

0.002 

0.0883 

0.0828 

1.1712 

23 

MO 

5.06430 

0.08731 

4.78189 

5.0071 

0.0572 

0.364 

0.3607 

0.004 

0.0787 

0.1054 

1.1541 

24 

MT 

4.73313 

0.15303 

4.70417 

4.9058 

-0.1727 

-1.073 

-1.0753 

0.012 

0.0312 

-0.1928 

1.0210 

25 

NE 

4.77558 

0.18907 

4.79671 

4.8735 

-0.0979 

-0.607 

-0.6021 

0.003 

0.0243 

-0.0950 

1.0719 

26 

NV 

4.96642 

0.32304 

4.83816 

4.7014 

0.2651 

1.677 

1.7143 

0.065 

0.0646 

0.4504 

0.9366 

27 

NH 

5.10990 

0.15852 

5.00319 

4.9500 

0.1599 

1.050 

1.0508 

0.055 

0.1308 

0.4076 

1.1422 

28 

NJ 

4.70633 

0.30901 

5.10268 

4.7657 

-0.0594 

-0.392 

-0.3879 

0.008 

0.1394 

-0.1562 

1.2337 

29 

NM 

4.58107 

0.16458 

4.58202 

4.8693 

-0.2882 

-1.823 

-1.8752 

0.076 

0.0639 

-0.4901 

0.9007 

30 

NY 

4.66496 

0.34701 

4.96075 

4.6904 

-0.0254 

-0.163 

-0.1613 

0.001 

0.0888 

-0.0503 

1.1755 

31 

ND 

4.58237 

0.18197 

4.69163 

4.8649 

-0.2825 

-1.755 

-1.7999 

0.031 

0.0295 

-0.3136 

0.8848 

32 

OH 

4.97952 

0.12889 

4.75875 

4.9475 

0.0320 

0.200 

0.1979 

0.001 

0.0423 

0.0416 

1.1174 

33 

OK 

4.72720 

0.19554 

4.62730 

4.8356 

-0.1084 

-0.681 

-0.6766 

0.008 

0.0505 

-0.1560 

1.0940 

34 

PA 

4.80363 

0.22784 

4.83516 

4.8282 

-0.0246 

-0.153 

-0.1509 

0.000 

0.0257 

-0.0245 

1.0997 

35 

RI 

4.84693 

0.30324 

4.84670 

4.7293 

0.1176 

0.738 

0.7344 

0.010 

0.0504 

0.1692 

1.0876 

36 

SC 

5.07801 

0.07944 

4.62549 

4.9907 

0.0873 

0.555 

0.5501 

0.008 

0.0725 

0.1538 

1.1324 

37 

SD 

4.81545 

0.13139 

4.67747 

4.9301 

-0.1147 

-0.716 

-0.7122 

0.007 

0.0402 

-0.1458 

1.0786 

38 

TN 

5.04939 

0.15547 

4.72525 

4.9062 

0.1432 

0.890 

0.8874 

0.008 

0.0294 

0.1543 

1.0457 

39 

TX 

4.65398 

0.28196 

4.73437 

4.7384 

-0.0845 

-0.532 

-0.5271 

0.005 

0.0546 

-0.1267 

1.1129 

40 

UT 

4.40859 

0.19260 

4.55586 

4.8273 

-0.4187 

-2.679 

-2.9008 

0.224 

0.0856 

-0.8876 

0.6786 

41 

VT 

5.08799 

0.18018 

4.77578 

4.8818 

0.2062 

1.277 

1.2869 

0.014 

0.0243 

0.2031 

0.9794 

42 

VA 

4.93065 

0.11818 

4.85490 

4.9784 

-0.0478 

-0.304 

-0.3010 

0.003 

0.0773 

-0.0871 

1.1556 

43 

WA 

4.66134 

0.35053 

4.85645 

4.6677 

-0.0064 

-0.041 

-0.0404 

0.000 

0.0866 

-0.0124 

1.1747 

44 

WV 

4.82454 

0.12008 

4.56859 

4.9265 

-0.1020 

-0.647 

-0.6429 

0.011 

0.0709 

-0.1777 

1.1216 

45 

WI 

4.83026 

0.22954 

4.75826 

4.8127 

0.0175 

0.109 

0.1075 

0.000 

0.0254 

0.0174 

1.1002 

46 

WY 

5.00087 

0.10029 

4.71169 

4.9777 

0.0232 

0.146 

0.1444 

0.000 

0.0555 

0.0350 

1.1345 
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2.  Add  the  (t  +  l)-th  observation  to  the  data  and  obtain  (3t+1  =  (X't+1Xt+ 1)  lX't+1Yt+ 
Compute  wt+2- 


3.  Repeat  step  2,  adding  one  observation  at  a  time.  In  time-series  regressions,  one  usually 
starts  with  the  first  ^-observations  and  obtain  ( T  —  k )  forward  recursive  residuals.  These 
recursive  residuals  can  be  computed  using  the  updating  formula  given  in  (8.11)  with 
A  =  ( X[Xt )  and  a  =  —b  =  x't+1.  Therefore, 

(X't+1Xt+ 1)-1  =  (X'tXt)-1-(XltXt)-1xt+1xlt+l(X'tXt)-1/[ l+x’t+l(X'tXt)-1xt+1]  (8.31) 

and  only  (A '[Xt)^1  have  to  be  computed.  Also, 

Pt+i  =  Pt  +  (X'tXt)-1xt+1(yt+ 1  -  x't+1Pt)/ft+ 1  (8.32) 

where  ft+i  =  1  +  x't+l(X'tXt)~l xt+i,  see  problem  13. 

Alternatively,  one  can  compute  these  residuals  by  regressing  Yt+\  on  Xt+ 1  and  dt+ 1  where 
dt+i  =  1  for  the  (t  +  l)-th  observation,  and  zero  otherwise,  see  equation  (8.5).  The  estimated 
coefficient  of  dt+ 1  is  the  numerator  of  wt+i-  The  standard  error  of  this  estimate  is  times 
the  denominator  of  rct+i,  where  st+i  is  the  standard  error  of  this  regression.  Hence,  wt+i  can 
be  retrieved  as  multiplied  by  the  t-statistic  corresponding  to  dt+i-  This  computation  has 
to  be  performed  sequentially,  in  each  case  generating  the  corresponding  recursive  residual.  This 
may  be  computationally  inefficient,  but  it  is  simple  to  generate  using  regression  packages. 

It  is  obvious  from  (8.30)  that  if  ut  ~  IIN(0,  a2),  then  wt+\  has  zero  mean  and  var(u;t+i)  =  a2. 
Furthermore,  wt+i  is  linear  in  the  y's.  Therefore,  it  is  normally  distributed.  It  remains  to  show 
that  the  recursive  residuals  are  independent.  Given  normality,  it  is  sufficient  to  show  that 


cov(w;i_|_i,  ws+i)  =  0  for  t  /  s;  t,  s  =  k, . . . ,  T  —  1 


(8.33) 


This  is  left  as  an  exercise  for  the  reader,  see  problem  13. 

Alternatively,  one  can  express  the  T  —  k  vector  of  recursive  residuals  as  w  =  Cy  where  C  is 
of  dimension  (T  —  k)  x  T  as  follows: 


C  = 


x,k+1(X'kXk)-1X'k 


Vf> 


fc+1 


v7 1 

7W 


y/fi 


fc+i 


0....0 


-^F  0....0 


1 

VJt 


1 

YTt 


(8.34) 


Problem  14  asks  the  reader  to  verify  that  w  =  Cy,  using  (8.30).  Also,  that  the  matrix  C  satisfies 
the  following  properties: 


(i)  CX  =  0  (ii)  CC'  =  IT-k  (m)  C'C  =  Px 


(8.35) 


This  means  that  the  recursive  residuals  w  are  (LUS)  linear  in  y,  unbiased  with  mean  zero  and 
have  a  scalar  variance-covariance  matrix:  var(rc)  =  CE(uv!)C'  =  cr2lT-k ■  Property  (iii)  also 
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means  that  w'w  =  y'C'Cy  =  y'Pxy  =  e'e.  This  means  that  the  sum  of  squares  of  (T  —  k) 
recursive  residuals  is  equal  to  the  sum  of  squares  of  T  least  squares  residuals.  One  can  also  show 
from  (8.32)  that 


RSSt+ 1  =  RSSt  +  w?+1  for  t  =  k,...,T-  1  (8.36) 

where  RSSt  =  (Ft  —  Xt(3t)'(Yt  —  Xt/3t),  see  problem  14.  Note  that  for  t  =  k\  RSS  =  0,  since 
with  k  observations  one  gets  a  perfect  fit  and  zero  residuals.  Therefore 

RSSt  =  Ylt=k+ 1  wt  =  Y)t= l  et  (8.37) 


Applications  of  Recursive  Residuals 

Recursive  residuals  have  been  used  in  several  important  applications: 


(1)  Harvey  ( 1 976 )  used  these  recursive  residuals  to  give  an  alternative  proof  of  the  fact  that 
Chow’s  post-sample  predictive  test  has  an  F-distribution.  Recall,  from  Chapter  7,  that  when 
the  second  sample  ri2  had  fewer  than  k  observations,  Chow’s  test  becomes 


p  =  (e'e-e'iei  )/n2 
e[ei/(ni  -  k) 


~  F(n2,ni  -  k ) 


(8.38) 


where  e'e  =  RSS  from  the  total  sample  ( n\  +  n2  =  T  observations),  and  e^ei  =  RSS  from  the 
first  rti  observations.  Recursive  residuals  can  be  computed  for  t  =  k  +  1, . . . ,  m,  and  continued 
on  for  the  extra  n2  observations.  From  (8.36)  we  have 


e  e  = 


v-mi+ri2 

Z^/t=k-\-l 


W, 


and  e\e\  =  E*=fc+i  wl 


(8.39) 


Therefore, 


ESi  ^t/n2 

Et=k+i  wt/(n i  -  k) 


(8.40) 


But  the  wt  s  are  ~IIN(0,  cr2)  under  the  null,  therefore  the  F-statistic  in  (8.38)  is  a  ratio  of  two 
independent  chi-squared  variables,  each  divided  by  the  appropriate  degrees  of  freedom.  Hence, 
F  ~  F(n2,  n\  —  k )  under  the  null,  see  Chapter  2. 


(2)  Harvey  and  Phillips  (1974)  used  recursive  residuals  to  test  the  null  hypothesis  of 
homoskedasticity.  If  the  alternative  hypothesis  is  that  cr2  varies  with  Xj,  the  proposed  test  is 
as  follows: 


1)  Order  the  data  according  to  Xj  and  choose  a  base  of  at  least  k  observations  from  among 
the  central  observations. 

2)  From  the  first  m  observations  compute  the  vector  of  recursive  residuals  w±  using  the  base 
constructed  in  step  1.  Also,  compute  the  vector  of  recursive  residuals  w2  from  the  last  m 
observations.  The  maximum  m  can  be  is  (T  —  k)/ 2. 
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3)  Under  the  null  hypothesis,  it  follows  that 

F  =  w'2W2/w[wi  ~  Fmtm  (8.41) 

Harvey  and  Phillips  suggest  setting  m  at  approximately  (n/3)  provided  n  >  3k.  This  test 
has  the  advantage  over  the  Goldfeld-Quandt  test  in  that  if  one  wanted  to  test  whether 
<7?  varies  with  some  other  variable  Xs ,  one  could  simply  regroup  the  existing  recursive 
residuals  according  to  low  and  high  values  of  Xs  and  compute  (8.41)  afresh,  whereas  the 
Goldfeld-Quandt  test  would  require  the  computation  of  two  new  regressions. 


(3)  Phillips  and  Harvey  (1974)  suggest  using  the  recursive  residuals  to  test  the  null  hy¬ 
pothesis  of  no  serial  correlation  using  a  modified  von  Neuman  ratio: 


MV  NR  = 


T,J=k+2(wt  -  wt-i)2/(T  -k-  1) 
£L+i  v?J{T-k) 


(8.42) 


This  is  the  ratio  of  the  mean-square  successive  difference  to  the  variance.  It  is  arithmetically 
closely  related  to  the  DW  statistic,  but  given  that  w  ~  1V(0,  a2Ir-k )  one  has  an  exact  test 
available  and  no  inconclusive  regions.  Phillips  and  Harvey  (1974)  provide  tabulations  of  the 
significance  points.  If  the  sample  size  is  large,  a  satisfactory  approximation  is  obtained  from  a 
normal  distribution  with  mean  2  and  variance  4/(T  —  k ). 


(4)  Harvey  and  Collier  (1977)  suggest  a  test  for  functional  misspecification  based  on 
recursive  residuals.  This  is  based  on  the  fact  that  w  N(01cr2lT_k).  Therefore, 

w/(sw/VT  -  k )  ~  fT-fc-i  (8.43) 

where  w  =  Y2t=k+ 1  wt/(?  ~  k)  and  =  Ylt=k+i(wt  ~  ™)2 /(T  —  k  —  1).  Suppose  that  the  true 
functional  form  relating  y  to  a  single  explanatory  variable  X  is  concave  (convex)  and  the  data 
are  ordered  by  X.  A  simple  linear  regression  is  estimated  by  regressing  y  on  X.  The  recursive 
residuals  would  be  expected  to  be  mainly  negative  (positive)  and  the  computed  i-statistic  will  be 
large  in  absolute  value.  When  there  are  multiple  A’s,  one  could  carry  out  this  test  based  on  any 
single  explanatory  variable.  Since  several  specification  errors  might  have  a  self-cancelling  effect 
on  the  recursive  residuals,  this  test  is  not  likely  to  be  very  effective  in  multivariate  situations. 
Wu  (1993)  suggested  performing  this  test  using  the  following  augmented  regression: 

y  =  X/3  +  z'y  +  v  (8.44) 

where  z  =  C'ix-k  is  one  additional  regressor  with  C  defined  in  (8.34)  and  M-k  denoting  a 
vector  of  ones  of  dimension  T  —  k.  In  fact,  the  T-statistic  for  testing  Hq;  7  =  0  turns  out  to  be 
the  square  of  the  Harvey  and  Collier  (1977)  i-statistic  given  in  (8.43),  see  problem  15. 

Alternatively,  a  Sign  test  may  be  used  to  test  the  null  hypothesis  of  no  functional  misspecifi¬ 
cation.  Under  the  null  hypothesis,  the  expected  number  of  positive  recursive  residuals  is  equal 
to  (T  —  k) / 2.  A  critical  region  may  therefore  be  constructed  from  the  binomial  distribution. 
However,  Harvey  and  Collier  (1977)  suggest  that  the  Sign  test  tends  to  lack  power  compared 
with  the  t-test  described  in  (8.43).  Nevertheless,  it  is  very  simple  and  it  may  be  more  robust  to 
non-normality. 
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(5)  Brown ,  Durbin  and  Evans  (1975)  used  recursive  residuals  to  test  for  structural  change 
over  time.  The  null  hypothesis  is 


Hn 


Pi  —  02  —  —  Pt  —  P 

of  =  a\  =  ..  =  a\  =  a2 


(8.45) 


where  (3t  is  the  vector  of  coefficients  in  period  t  and  cr\  is  the  disturbance  variance  for  that 
period.  The  authors  suggest  a  pair  of  tests.  The  first  is  the  CUSUM  test  which  computes 


Wr  =  J2t=k+ 1  Wt/sw  for  r  =  k  +  1, . . . ,  T  (8.46) 

where  is  an  estimate  of  the  variance  of  the  wt  s,  given  below  (8.43).  Wr  is  a  cumulative  sum 
and  should  be  plotted  against  r.  Under  the  null,  E(Wr)  =  0.  But,  if  there  is  a  structural  break, 
Wr  will  tend  to  diverge  from  the  horizontal  line.  The  authors  suggest  checking  whether  Wr 
cross  a  pair  of  straight  lines  (see  Figure  8.1)  which  pass  through  the  points  j/c,  ±a\/T  —  fe}and 
{T,  ±3 ay/T  —  k }  where  a  depends  upon  the  chosen  significance  level  a.  For  example,  a  = 
0.850,0.948,  and  1.143  for  a  =  10%,  5%,  and  1%  levels,  respectively. 

If  the  coefficients  are  not  constant,  there  may  be  a  tendency  for  a  disproportionate  number 
of  recursive  residuals  to  have  the  same  sign  and  to  push  Wr  across  the  boundary.  The  second 
test  is  the  cumulative  sum  of  squares  (CUSUMSQ)  which  is  based  on  plotting 

K  =  EU+i  w t  /  £f=fc+i  for  t  =  k  +  1, . . . ,  T  (8.47) 

against  r.  Under  the  null,  E(W*)  =  (r  —  k)/(T  —  k )  which  varies  from  0  for  r  =  k  to  1 
for  r  =  T.  The  significance  of  the  departure  of  W *  from  its  expected  value  is  assessed  by 
whether  W*  crosses  a  pair  of  lines  parallel  to  E(W*)  at  a  distance  cQ  above  and  below  this  line. 
Brown,  Durbin  and  Evans  (1975)  provide  values  of  cq  for  various  sample  sizes  T  and  levels  of 
significance  a. 

The  CUSUM  and  CUSUMSQ  should  be  regarded  as  data  analytic  techniques;  i.e.,  the  value  of 
the  plots  lie  in  the  information  to  be  gained  simply  by  inspecting  them.  The  plots  contain  more 
information  than  can  be  summarized  in  a  single  test  statistic.  The  significance  lines  constructed 
are,  to  paraphrase  the  authors,  best  regarded  as  ‘yardsticks’  against  which  to  assess  the  observed 
plots  rather  than  as  formal  tests  of  significance.  See  Brown  et  al.  (1975)  for  various  examples. 
Note  that  the  CUSUM  and  CUSUMSQ  are  quite  general  tests  for  structural  change  in  that 
they  do  not  require  a  prior  determination  of  where  the  structural  break  takes  place.  If  this  is 
known,  the  Chow-test  will  be  more  powerful.  But,  if  this  break  is  not  known,  the  CUSUM  and 
CUSUMSQ  are  more  appropriate. 

Example  2:  Table  8.5  reproduces  the  consumption-income  data,  over  the  period  1959-2007, 
taken  from  the  Economic  Report  of  the  President.  In  addition,  the  recursive  residuals  are 
computed  as  in  (8.30)  and  exhibited  in  column  5,  starting  with  1961  and  ending  in  2007.  The 
CUSUM  given  by  Wr  in  (8.46)  is  plotted  against  r  in  Figure  8.2.  The  CUSUM  crosses  the 
upper  5%  line  in  1998,  showing  structural  instability  in  the  latter  years.  This  was  done  using 
EViews  6. 

The  post-sample  predictive  test  for  1998,  can  be  obtained  from  (8.38)  by  computing  the  RSS 
from  1950-1997  and  comparing  it  with  the  RSS  from  1950-2007.  The  observed  F-statistic  is 
5.748  which  is  distributed  as  F(10,37).  Using  EViews,  one  clicks  on  stability  diagnostics  and 
then  selects  Chow  forecast  test.  You  will  be  prompted  to  enter  the  break  point  period  which 
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Figure  8.1  CUSUM  Critical  Values 
Table  8.4  Chow  Forecast  Test 


Specification:  CONSUM  C  Y 

Test  predictions  for  observations  from  1998  to  2007 


Value 

df 

Probability 

F-statistic 

5.747855 

(10,37) 

0.0000 

Likelihood  ratio 

45.93529 

10 

0.0000 

F-test  summary: 

Sum  of  Sq. 

df 

Mean  Squares 

Test  SSR 

5476210. 

10 

547621.0 

Restricted  SSR 

9001348. 

47 

191518.0 

Unrestricted  SSR 

3525138. 

37 

95273.99 

LR  test  summary: 

Value 

df 

Restricted  LogL 

-366.4941 

47 

Unrestricted  LogL 

-343.5264 

37 

in  this  case  is  1998.  EViews  gives  the  back  up  regression  which  is  not  shown  here,  and  also 
performs  a  likelihood  ratio  test,  see  Table  8.4. 

The  reader  can  verify  that  the  same  E-statistic  can  be  obtained  from  (8.40)  using  the  recursive 
residuals  in  Table  8.5.  In  fact, 


F  =  (Ei=0i7998  %2/10)/(Et=i96i^2/37)  =  5.748 


8.2  Recursive  Residuals 


195 


Table  8.5  Recursive  Residuals  for  the  Consumption  Regression 


Year 

CONSUM 

Income 

RESID 

Recursive  RES 

1959 

8776 

9685 

635.4909 

NA 

1960 

8837 

9735 

647.5295 

NA 

1961 

8873 

9901 

520.9776 

-30.06109 

1962 

9170 

10227 

498.7493 

53.63333 

1963 

9412 

10455 

517.4853 

57.07454 

1964 

9839 

11061 

351.0732 

-14.42043 

1965 

10331 

11594 

321.1447 

40.23840 

1966 

10793 

12065 

321.9283 

72.59054 

1967 

10994 

12457 

139.0709 

-58.72718 

1968 

11510 

12892 

229.1068 

88.63871 

1969 

11820 

13163 

273.7360 

125.0883 

1970 

11955 

13563 

17.04481 

-88.54736 

1971 

12256 

14001 

-110.8570 

-123.0740 

1972 

12868 

14512 

0.757470 

68.23355 

1973 

13371 

15345 

-311.9394 

-118.2972 

1974 

13148 

15094 

-289.1532 

-100.8288 

1975 

13320 

15291 

-310.0611 

-72.86693 

1976 

13919 

15738 

-148.7760 

148.9270 

1977 

14364 

16128 

-85.67493 

231.2810 

1978 

14837 

16704 

-176.7102 

178.9840 

1979 

15030 

16931 

-205.9950 

147.8067 

1980 

14816 

16940 

-428.8080 

-80.37207 

1981 

14879 

17217 

-637.0542 

-229.1660 

1982 

14944 

17418 

-768.8790 

-296.0910 

1983 

15656 

17828 

-458.3625 

86.49899 

1984 

16343 

19011 

-929.7892 

-205.6594 

1985 

17040 

19476 

-688.1302 

111.3357 

1986 

17570 

19906 

-579.1982 

251.5306 

1987 

17994 

20072 

-317.7500 

479.8759 

1988 

18554 

20740 

-411.8743 

405.8181 

1989 

18898 

21120 

-439.9809 

366.8060 

1990 

19067 

21281 

-428.6367 

347.8156 

1991 

18848 

21109 

-479.2094 

243.0261 

1992 

19208 

21548 

-549.0905 

195.0177 

1993 

19593 

21493 

-110.2330 

588.3097 

1994 

20082 

21812 

66.39330 

731.2551 

1995 

20382 

22153 

32.47656 

660.7508 

1996 

20835 

22546 

100.6400 

696.2055 

1997 

21365 

23065 

122.4207 

689.6197 

1998 

22183 

24131 

-103.4364 

474.3981 

1999 

23050 

24564 

339.5579 

870.8977 

2000 

23862 

25472 

262.4189 

751.2861 

2001 

24215 

25697 

395.0926 

808.6041 

2002 

24632 

26238 

282.3303 

639.0555 

2003 

25073 

26566 

402.1435 

700.0686 

2004 

25750 

27274 

385.8501 

633.1310 

2005 

26290 

27403 

799.5297 

970.8717 

2006 

26835 

28098 

663.9663 

760.6385 

2007 

27319 

28614 

642.6847 

673.7335 
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Figure  8.2  CUSUM  Plot  of  the  Consumption  Regression 

8.3  Specification  Tests 

Specification  tests  are  an  important  part  of  model  specification  in  econometrics.  In  this  section, 
we  only  study  a  few  of  these  diagnostic  tests.  For  an  excellent  summary  on  this  topic,  see 
Wooldridge  (2001). 

(1)  Ramsey’s  (1969)  RESET  (Regression  Specification  Error  Test) 

Ramsey  suggests  testing  the  specification  of  the  linear  regression  model  yt  =  X't(3  +  ut  by 
augmenting  it  with  a  set  of  regressors  Zt  so  that  the  augmented  model  is 

yt  =  X't/3  +  Z[  7  +  ut  (8.48) 

If  the  Zt  s  are  available  then  the  specification  test  would  reduce  to  the  F-test  for  7  =  0. 
The  crucial  issue  is  the  choice  of  Zt  variables.  This  depends  upon  the  true  functional  form  under 
the  alternative,  which  is  usually  unknown.  However,  this  can  be  often  well  approximated  by 
higher  powers  of  the  initial  regressors,  as  in  the  case  where  the  true  form  is  quadratic  or  cubic. 
Alternatively,  one  might  approximate  it  with  higher  moments  of  yt  =  X[f3OLg.  The  popular 
Ramsey  RESET  test  is  carried  out  as  follows: 

(1)  Regress  yt  on  Xt  and  get  yt. 

(2)  Regress  yt  on  Xt ,  yf ,  yf  and  yf  and  test  that  the  coefficients  of  all  the  powers  of  yt  are 
zero.  This  is  an  Fs^T-k-3  under  the  null. 

Note  that  yt  is  not  included  among  the  regressors  because  it  would  be  perfectly  multicollinear 
with  Xt .3  Different  choices  of  Zt  s  may  result  in  more  powerful  tests  when  Hq  is  not  true. 
Thursby  and  Schmidt  (1977)  carried  out  an  extensive  Monte  Carlo  and  concluded  that  the  test 
based  on  Zt  =  [Xf,  Xf,  Xf]  seems  to  be  generally  the  best  choice. 
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(2)  Utts’  (1982)  Rainbow  Test 

The  basic  idea  behind  the  Rainbow  test  is  that  even  when  the  true  relationship  is  nonlinear,  a 
good  linear  fit  can  still  be  obtained  over  subsets  of  the  sample.  The  test  therefore  rejects  the 
null  hypothesis  of  linearity  whenever  the  overall  fit  is  markedly  inferior  to  the  fit  over  a  properly 
selected  sub-sample  of  the  data,  see  Figure  8.3. 


Figure  8.3  The  Rainbow  Test 

Let  e'e  be  the  OLS  residuals  sum  of  squares  from  all  available  n  observations  and  let  e'e  be  the 
OLS  residual  sum  of  squares  from  the  middle  half  of  the  observations  (T / 2).  Then 

_ g*e)  / 

e'e/  (f  - 

Under  Hq\  E(e'e/(T—k ))  =  a2  =  E  [e'e/  —  L)] ,  while  in  general  under  Ha]  E(e'e/(T—k ))  > 

E  [e'e/  (^  —  fe)]  >  o’2.  The  RRSS  is  e'e  because  all  the  observations  are  forced  to  fit  the  straight 
line,  whereas  the  URSS  is  e*e  because  only  a  part  of  the  observations  are  forced  to  fit  a  straight 
line.  The  crucial  issue  of  the  Rainbow  test  is  the  proper  choice  of  the  subsample  (the  middle  T / 2 
observations  in  case  of  one  regressor).  This  affects  the  power  of  the  test  and  not  the  distribution 
of  the  test  statistic  under  the  null.  Utts  (1982)  recommends  points  close  to  X,  since  an  incorrect 
linear  fit  will  in  general  not  be  as  far  off  there  as  it  is  in  the  outer  region.  Closeness  to  X  is 
measured  by  the  magnitude  of  the  corresponding  diagonal  elements  of  Px-  Close  points  are 
those  with  low  leverage  ha ,  see  section  8.1.  The  optimal  size  of  the  subset  depends  upon  the 
alternative.  Utts  recommends  about  1/2  of  the  data  points  in  order  to  obtain  some  robustness 
to  outliers.  The  F-test  in  (8.49)  looks  like  a  Chow  test,  but  differs  in  the  selection  of  the 
sub-sample.  For  example,  using  the  post-sample  predictive  Chow  test,  the  data  are  arranged 
according  to  time  and  the  first  T  observations  are  selected.  The  Rainbow  test  arranges  the  data 
according  to  their  distance  from  X  and  selects  the  first  T j 2  of  them. 

(3)  Plosser,  Schwert  and  White  (1982)  (PSW)  Differencing  Test 

The  differencing  test  is  a  general  test  for  misspecification  (like  Hausman’s  (1978)  test,  which 
will  be  introduced  in  the  simultaneous  equation  chapter)  but  for  time-series  data  only.  This  test 
compares  OLS  and  First  Difference  (FD)  estimates  of  (3.  Let  the  differenced  model  be 


(I) 


k) 


is  distributed  as  Ft 


!■(?-*) 


under  H$ 


(8.49) 
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y  =  X/3  +  u  (8.50) 


where  ij  =  Dy,  X  = 

DX  and  u  = 

Du 

where 
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the  familiar  (T  —  1)  x  T  differencing  matrix. 
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dropped.  From  (8.50),  the  FD  estimator  is  given  by 

Pfd  =  (X'xr'X'y  (8.51) 

with  var (/3FF>)  =  a2  (X’ X)~l  X’ D  D' X  (X' X)~l  since  var(h)  =  a2(DD)'  and 
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The  differencing  test 

is 

based  on 

Q  ~  Pfd  —  Pols  with  V(q)  —  <l2\V((3fd)  —  V(@0ls)]  (8.52) 

A  consistent  estimate  of  V (q)  is 


where  a2  is  a  consistent  estimate  of  a2.  Therefore, 

A  =  Tq\y(q)]-lq~  xl  under  H0  (8.54) 

where  k  is  the  number  of  slope  parameters  if  V(q)  is  nonsingular.  V(q)  could  be  singular,  in 
which  case  we  use  a  generalized  inverse  V~~(q)  of  V(q)  and  in  this  case  is  distributed  as  x2 
with  degrees  of  freedom  equal  to  the  rank(F(<?)).  This  is  a  special  case  of  the  general  Hausman 
(1978)  test  which  will  be  studied  extensively  in  Chapter  11. 

Davidson,  Godfrey,  and  MacKinnon  (1985)  show  that,  like  the  Hausman  test,  the  PSW  test 
is  equivalent  to  a  much  simpler  omitted  variables  test,  the  omitted  variables  being  the  sum  of 
the  lagged  and  one-period  ahead  values  of  the  regressors. 

Thus  if  the  regression  equation  we  are  considering  is 


Ut  =  PlXlt  +  @2x2  t  +  Ut 


(8.55) 


the  PSW  test  involves  estimating  the  expanded  regression  equation 
Vt  =  PlXlt  +  /?2  x2t  +  7lAlt  +  72^2 1  + 


(8.56) 
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where  z±t  =  x\tt+ 1  +  £i,t-i  and  Z2t  =  X2,t+i  +  ®2,t-i  and  testing  the  hypothesis  7i  =  72  =  0  by 
the  usual  F-test. 

If  there  are  lagged  dependent  variables  in  the  equation,  the  test  needs  a  minor  modification. 
Suppose  that  the  model  is 

Vt  =  PiVt-i  +  /32xt  +  ut  (8.57) 

Now  the  omitted  variables  would  be  defined  as  z\t  =  yt  +  yt- 2  and  Z2t  =  xt+i  +  xt-i-  There  is  no 
problem  with  Z2t  but  z\t  would  be  correlated  with  the  error  term  ut  because  of  the  presence  of 
yt  in  it.  The  solution  would  be  simply  to  transfer  it  to  the  left  hand  side  and  write  the  expanded 
regression  equation  in  (8.56)  as 

(1  -  7i)yt  =  Piyt-i  +  P2xt  +  hVt-2  +  72^2 1  +  ut  (8.58) 

This  equation  can  be  written  as 

Vt  =  Pht-i  +  P*2xt  +  llVt-2  +  72^24  +  u*t  (8.59) 

where  all  the  starred  parameters  are  the  corresponding  unstarred  ones  divided  by  (1  —  71). 

The  PSW  now  tests  the  hypothesis  7i  =  72  =  0-  Thus,  in  the  case  where  the  model  involves 
the  lagged  dependent  variable  yt- 1  as  an  explanatory  variable,  the  only  modification  needed 
is  that  we  should  use  yt- 2  as  the  omitted  variable,  not  (yt  +  yt- 2)-  Note  that  it  is  only  yt-\ 
that  creates  a  problem,  not  higher-order  lags  of  yt,  like  yt-2,yt-3,  and  so  on.  For  yt~ 2,  the 
corresponding  zt  will  be  obtained  by  adding  yt~\  to  yt~ 3.  This  zt  is  not  correlated  with  ut  as 
long  as  the  disturbances  are  not  serially  correlated. 

(4)  Tests  for  Non-nested  Hypothesis 

Consider  the  following  two  competing  non-nested  models: 

Hr,  y  =  X1/31  +  ei  (8.60) 

H2]  y  =  X2P2  +  e2  (8.61) 

These  are  non-nested  because  the  explanatory  variables  under  one  model  are  not  a  subset  of 
the  other  model  even  though  X\  and  X2  may  share  some  common  variables.  In  order  to  test  H\ 
versus  H2,  Cox  (1961)  modified  the  LR-test  to  allow  for  the  non-nested  case.  The  idea  behind 
Cox’s  approach  is  to  consider  to  what  extent  Model  I  under  Hi,  is  capable  of  predicting  the 
performance  of  Model  II,  under  H2. 

Alternatively,  one  can  artificially  nest  the  2  models 

F3;  y  =  X1P1  +  X*fJ*  +  e3  (8.62) 

where  X|  excludes  from  X2  the  common  variables  with  X\ .  A  test  for  H\  is  simply  the  F-test 
for  Hq:  P*  =  0. 

Criticism:  This  tests  Hi  versus  F3  which  is  a  (Hybrid)  of  H 1  and  H2  and  not  H\  versus  H2. 
Davidson  and  MacKinnon  (1981)  proposed  (testing  a  =  0)  in  the  linear  combination  of  H 1  and  H2: 

y  =  (1  -  a)Xi(3i  +  aX2/32  +  e  (8.63) 

where  a  is  an  unknown  scalar.  Since  a  is  not  identified,  we  replace  /32  by  /32  ols  =  (. X'2X2/T )-! 
(X2y/T)  the  regression  coefficient  estimate  obtained  from  running  y  on  X2  under  H2,  i.e. ,  (1) 
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Run  y  on  X2  get  y2  =  (2)  Run  D  on  X\  and  y2  and  test  that  the  coefficient  of  y2  is 

zero.  This  is  known  as  the  J-test  and  this  is  asymptotically  N( 0, 1)  under  H\. 

Fisher  and  McAleer  (1981)  suggested  a  modification  of  the  J-test  known  as  the  JA  test. 

Under  JZi;plim/32  =  plirr^X^A^/T^'plimfXgAi /T)fix  +  0  (8.64) 

Therefore,  they  propose  replacing  /32  by  /32  =  {X'2X2)-l{X2Xi)f3l  OLS  where  /31  OLS  =  (A(Ad)-1 
X[y.  The  steps  for  the  JA-test  are  as  follows: 

1.  Run  y  on  Xx  get  y±  =  x{f310LS. 

2.  Run  yi  on  X2  get  y2  =  X2(X2X2)~1X2y1. 

3.  Run  y  on  X\  and  y2  and  test  that  the  coefficient  of  y2  is  zero.  This  is  the  simple  t-statistic 
on  the  coefficient  of  y2-  The  J  and  JA  tests  are  asymptotically  equivalent. 

Criticism:  Note  the  asymmetry  of  H 1  and  H2.  Therefore  one  should  reverse  the  role  of  these 
hypotheses  and  test  again. 

In  this  case  one  can  get  the  four  scenarios  depicted  in  Table  8.6.  In  case  both  hypotheses  are 
not  rejected,  the  data  are  not  rich  enough  to  discriminate  between  the  two  hypotheses.  In  case 
both  hypotheses  are  rejected  neither  model  is  useful  in  explaining  the  variation  in  y.  In  case 
one  hypothesis  is  rejected  while  the  other  is  not,  one  should  remember  that  the  non-rejected 
hypothesis  may  still  be  brought  down  by  another  challenger  hypothesis. 

Small  Sample  Properties:  (i)  The  J-test  tends  to  reject  the  null  more  frequently  than  it 
should.  Also,  the  JA  test  has  relatively  low  power  when  K\ ,  the  number  of  parameters  in  H\  is 
larger  than  K2,  the  number  of  parameters  in  iJ2.  Therefore,  one  should  use  the  JA  test  when 
I\\  is  about  the  same  size  as  K2,  i.e. ,  the  same  number  of  non-overlapping  variables,  (ii)  If  both 
iJi  and  H2  are  false,  these  tests  are  inferior  to  the  standard  diagnostic  tests.  In  practice,  use 
higher  significance  levels  for  the  J-test,  and  supplement  it  with  the  artificially  nested  F-test 
and  standard  diagnostic  tests. 

Table  8.6  Non-nested  Hypothesis  Testing 


a  =  0 


Not  Rejected 

Rejected 

Not  Rejected 

Both  Hi  and  H2 
are  not  rejected 

Hi  rejected 

H2  not  rejected 

Rejected 

Hi  not  rejected 

H2  rejected 

Both  Hi  and  H2 
are  rejected 

Note:  J  and  JA  tests  are  one  degree  of  freedom  tests,  whereas  the  artificially  nested  F-test  is 
not. 

For  a  recent  summary  of  non-nested  hypothesis  testing,  see  Pesaran  and  Weeks  (2001).  Exam¬ 
ples  of  non-nested  hypothesis  encountered  in  empirical  economic  research  include  linear  versus 
log-linear  models,  see  section  8.5.  Also,  logit  versus  probit  models  in  discrete  choice,  see  Chap¬ 
ter  13  and  exponential  versus  Weibull  distributions  in  the  analysis  of  duration  data.  In  the 
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logit  versus  probit  specification,  the  set  of  regressors  is  most  likely  to  be  the  same.  It  is  only 
the  form  of  the  distribution  functions  that  separate  the  two  models.  Pesaran  and  Weeks  (2001, 
p.  287)  emphasize  the  differences  between  hypothesis  testing  and  model  selection: 

The  model  selection  process  treats  all  models  under  consideration  symmetrically, 
while  hypothesis  testing  attributes  a  different  status  to  the  null  and  to  the  alternative 
hypotheses  and  by  design  treats  the  models  asymmetrically.  Model  selection  always 
ends  in  a  definite  outcome,  namely  one  of  the  models  under  consideration  is  selected 
for  use  in  decision  making.  Hypothesis  testing  on  the  other  hand  asks  whether  there 
is  any  statistically  significant  evidence  (in  the  Neyman- Pears  on  sense)  of  departure 
from  the  null  hypothesis  in  the  direction  of  one  or  more  alternative  hypotheses. 
Rejection  of  the  null  hypothesis  does  not  necessarily  imply  acceptance  of  any  one  of 
the  alternative  hypotheses;  it  only  warns  the  investigator  of  possible  shortcomings  of 
the  null  that  is  being  advocated.  Hypothesis  testing  does  not  seek  a  definite  outcome 
and  if  carried  out  with  due  care  need  not  lead  to  a  favorite  model.  For  example,  in  the 
case  of  nonnested  hypothesis  testing  it  is  possible  for  all  models  under  consideration 
to  be  rejected,  or  all  models  to  be  deemed  as  observationally  equivalent. 

They  conclude  that  the  choice  between  hypothesis  testing  and  model  selection  depends  on  the 
primary  objective  of  one’s  study.  Model  selection  may  be  more  appropriate  when  the  objective 
is  decision  making,  while  hypothesis  testing  is  better  suited  to  inferential  problems. 

A  model  may  be  empirically  adequate  for  a  particular  purpose,  but  of  little  relevance 
for  another  use...  In  the  real  world  where  the  truth  is  elusive  and  unknowable  both 
approaches  to  model  evaluation  are  worth  pursuing. 


(5)  White’s  (1982)  Information-Matrix  (IM)  Test 

This  is  a  general  specification  test  much  like  the  Hausman  (1978)  specification  test  which  will 
be  considered  in  details  in  Chapter  11.  The  latter  is  based  on  two  different  estimates  of  the 
regression  coefficients,  while  the  former  is  based  on  two  different  estimates  of  the  Information 
Matrix  1(0)  where  6'  =  (ft,  a2)  in  the  case  of  the  linear  regression  studied  in  Chapter  7.  The 
first  estimate  of  1(6)  evaluates  the  expectation  of  the  second  derivatives  of  the  log-likelihood 
at  the  MLE,  i.e. ,  —  E(d2\ogL/d9d9l)  at  6mie  while  the  second  sum  up  the  outer  products  of 
the  score  vectors  Yfti=i(d^°gLi(9)  /  d9)(d\ogLi(9)  /  89)'  evaluated  at  9mie.  This  is  based  on  the 
fundamental  identity  that 

1(6)  =  -E(d2\ogL/d9d9')  =  E(d\ogL/d9)(d\ogL/d9)' 


If  the  model  estimated  by  MLE  is  not  correctly  specified,  this  equality  will  not  hold.  From  Chap¬ 
ter  7,  equation  (7.19),  we  know  that  for  the  linear  regression  model  with  normal  disturbances, 
the  first  estimate  of  1(9)  denoted  by  Ii(9mie)  is  given  by 


Ii(9mle) 


X'X/d 2  0 

0  n/2a4 


(8.65) 


where  a2  =  e'e/n  is  the  MLE  of  a2  and  e  denotes  the  OLS  residuals. 


202 


Chapter  8:  Regression  Diagnostics  and  Specification  Tests 


Similarly,  one  can  show  that  the  second  estimate  of  1(d)  denoted  by  12(d)  is  given  by 

T  ( dlog Lj(d)\  ( aiog Lj(d) 

2(  )  Ei=l  qq  J  ^  Qe 


_  sr^n 

~  l^i=  1 


ufxix'i 


UiXi 


(7 
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ujx( 


-UjXj  UfXj 
,  2<7‘  2  2176  4 

1  .  <4 

4  cj4  2<r6  4<t8 


2  a4  2  a6 

where  Xi  is  the  i-th  row  of  X.  Substituting  the  MLE  we  get 

pW  Vn  p3o-  • 


(8.66) 


h(dMLE )  = 
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Er=i  ^ 

2<76 
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n 
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En 

i=le^ 


4u8 


(8.67) 


where  we  used  the  fact  that  EEi  =  0-  ^  the  model  is  correctly  specified  and  the  distur¬ 
bances  are  normal  then 


plinr  Ii(dMLE)/n  =  plim  h0MLE)/n  =  1(d) 


Therefore,  the  Information  Matrix  (IM)  test  rejects  the  model  when 
[- h(&MLE )  ~  h(dMLE)\/n 


(8.68) 


is  too  large.  These  are  two  matrices  with  (k  +  1)  by  (k  +  1)  elements  since  j3  is  k  x  1  and  a 2 
is  a  scalar.  However,  due  to  symmetry,  this  reduces  to  (k  +  2 )(k  +  l)/2  unique  elements.  Hall 
(1987)  noted  that  the  first  k(k  +  l)/2  unique  elements  obtained  from  the  first  k  x  k  block  of 
(8.68)  have  a  typical  element  Ei=i(e?  —  &~)xirXiS/ndA  where  r  and  s  denote  the  r-th  and  s-th 
explanatory  variables  with  r,s  =  1, 2, . . . ,  k.  This  term  measures  the  discrepancy  between  the 
OLS  estimates  of  the  variance-covariance  matrix  of  Pols  and  its  robust  counterpart  suggested 
by  White  (1980),  see  Chapter  5.  The  next  k  unique  elements  correspond  to  the  off-diagonal  block 
Y0i= 1  efxj/2n<T6and  this  measures  the  discrepancy  between  the  estimates  of  the  cov(/3,  a2).  The 
last  element  correspond  to  the  difference  in  the  bottom  right  elements,  i.e.,  the  two  estimates 
of  a2 .  This  is  given  by 


+  -E 

n 


n 

i=  1 


These  (k  +  l)(k+2) /2  unique  elements  can  be  arranged  in  vector  form  D(d)  which  has  a  limiting 
normal  distribution  with  zero  mean  and  some  covariance  matrix  V (d)  under  the  null.  One  can 
show,  see  Hall  (1987)  or  Kramer  and  Sonnberger  (1986)  that  if  V(d)  is  estimated  from  the 
sample  moments  of  these  terms,  that  the  IM  test  statistic  is  given  by 


m  =  nD’(d)[V(d)]-lD(d)  x\k+m+2V2  (8-69) 

In  fact,  Hall  (1987)  shows  that  this  statistic  is  the  sum  of  three  asymptotically  independent 
terms 


m  =  m\  +  m2  +  m3  (8.70) 

where  mi  =  a  particular  version  of  White’s  heteroskedasticity  test;  m2  =  n  times  the  explained 
sum  of  squares  from  the  regression  of  e?  on  Xi  divided  by  6<r6;  and 
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m3  =  2^  ^=1  ^/n  " 

which  is  similar  to  the  Jarque-Bera  test  for  normality  of  the  disturbances  given  in  Chapter  5. 

It  is  clear  that  the  IM  test  will  have  power  whenever  the  disturbances  are  non-normal  or 
heteroskedastic.  However,  Davidson  and  MacKinnon  (1992)  demonstrated  that  the  IM  test 
considered  above  will  tend  to  reject  the  model  when  true,  much  too  often,  in  finite  samples.  This 
problem  gets  worse  as  the  number  of  degrees  of  freedom  gets  large.  In  Monte  Carlo  experiments, 
Davidson  and  MacKinnon  (1992)  showed  that  for  a  linear  regression  model  with  ten  regressors, 
the  IM  test  rejected  the  null  at  the  5%  level,  99.9%  of  the  time  for  n  =  200.  This  problem  did 
not  disappear  when  n  increased.  In  fact,  for  n  =  1000,  the  IM  test  still  rejected  the  null  92.7% 
of  the  time  at  the  5%  level. 

These  results  suggest  that  it  may  be  more  useful  to  run  individual  tests  for  non-normality, 
heteroskedasticity  and  other  misspecification  tests  considered  above  rather  than  run  the  IM 
test.  These  tests  may  be  more  powerful  and  more  informative  than  the  IM  test.  Alternative 
methods  of  calculating  the  IM  test  with  better  finite-sample  properties  are  suggested  in  Orrne 
(1990),  Chesher  and  Spady  (1991)  and  Davidson  and  MacKinnon  (1992). 

Example  3:  For  the  consumption-income  data  given  in  Table  5.3,  we  first  compute  the  RESET 
test  from  the  consumption-income  regression  given  in  Chapter  5.  Using  EViews,  one  clicks  on 
stability  tests  and  then  selects  RESET.  You  will  be  prompted  with  the  option  of  the  number 
of  fitted  terms  to  include  (i.e. ,  powers  of  y).  Table  8.7  shows  the  RESET  test  including  y 2  and 
y3.  The  F-statistic  for  their  joint-significance  is  equal  to  94.94.  This  is  significant  and  indicates 
misspecification. 


Table  8.7  Ramsey  RESET  Test 


F-statistic 

Log  likelihood  ratio 

94.93796 

80.96735 

Prob.  F(2,45) 
Prob.  Chi-Square(2) 

0.00000 

0.00000 

Test  Equation: 

Dependent  Variable:  CONSUM 

Method:  Least  Squares 

Sample:  1959  2007 

Included  observations:  49 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

3519.599 

1141.261 

3.083956 

0.0035 

Y 

0.421587 

0.173597 

2.428540 

0.0192 

FITTED  "2 

1.99E-05 

1.09E-05 

1.834317 

0.0732 

FITTED  "3 

-1.18E-10 

2.10E-10 

-0.560377 

0.5780 

R-squared 

0.998789 

Mean  dependent  var 

16749.10 

Adjusted  R-squared 

0.998708 

S.D.  dependent  var 

5447.060 

S.E.  of  regression 

195.7648 

Akaike  info  criterion 

13.46981 

Sum  squared  resid 

1724573. 

Schwarz  criterion 

13.62425 

Log  likelihood 

-326.0104 

Hannan-Quinn  criter. 

13.52840 

F-statistic 

Prob(F-statistic) 

12372.26 

0.000000 

Durbin- Watson  stat 

1.001605 
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Table  8.8  Consumption  Regression  1971-1995 


Dependent  Variable:  CONSUM 

Method:  Least  Squares 

Sample:  1971  1995 

Included  observations:  25 

Variable 

Coefficient 

Std.  Error  t-Statistic 

Prob. 

C 

-1410.425 

371.3812  -3.797783 

0.0009 

Y 

0.963780 

0.020036  48.10199 

0.0000 

R-squared 

0.990157 

Mean  dependent  var 

16279.48 

Adjusted  R-squared 

0.989730 

S.D.  dependent  var 

2553.097 

S.E.  of  regression 

258.7391 

Akaike  info  criterion 

14.02614 

Sum  squared  resid 

1539756. 

Schwarz  criterion 

14.12365 

Log  likelihood 

-173.3267 

Hannan-Quinn  criter. 

14.05318 

F-statistic 

2313.802 

Durbin- Watson  stat 

0.613064 

Prob(F-statistic) 

0.000000 

Next,  we  compute  Utts  (1982)  Rainbow  test.  Table  8.8  gives  the  middle  25  observations  of  our 
data,  i.e.,  1971-1995,  and  the  EViews  6  regression  using  this  data.  The  RSS  of  these  middle 
observations  is  given  by  'Se  =  1539756.14,  while  the  RSS  for  the  entire  sample  is  given  by 
e'e  =  9001347.76  so  that  the  observed  E-statistic  given  in  (8.49)  can  be  computed  as  follows: 

_  (9001347.76  -  1539756.14)/25  _ 

“  1539756.14/23  “  '  ° 

This  is  distributed  as  E25.23  under  the  null  hypothesis  and  rejects  the  hypothesis  of  linearity. 

The  PSW  differencing  test  is  computed  using  the  artificial  regression  given  in  (8.56)  with 
Zt  =  Yt+ 1  +  Yt- 1-  The  results  are  given  in  Table  8.9  using  EViews  6.  The  f-statistic  for  Zt  is 
1.19  and  has  a  p- value  of  0.24  which  is  insignificant. 

Now  consider  the  two  competing  non-nested  models: 

Ct  =  (3  0  +  Pi  Yt  +  P2Yt_i  +  ut  H2\  Ct  =  7o  +  7i  Yt  +  72^-1  +  vt 

The  two  non-nested  models  share  Yt  as  a  common  variable.  The  artificial  model  that  nests  these 
two  models  is  given  by: 

H3]  Ct  =  6 0  +  6iYt  +  62Yt- 1  +  hCt- 1  +  et 

Table  8.10,  runs  regression  (1)  given  by  H2  and  obtains  the  predicted  values  C2(C2HAT). 
Regression  (2)  runs  consumption  on  a  constant,  income,  lagged  income  and  C2HAT.  The  coef¬ 
ficient  of  this  last  variable  is  1.18  and  is  statistically  significant  with  a  t-value  of  16.99.  This  is 
the  Davidson  and  MacKinnon  (1981)  J-test.  In  this  case,  H\  is  rejected  but  H2  is  not  rejected. 
The  JA-test,  given  by  Fisher  and  McAleer  (1981)  runs  the  regression  in  H 1  and  keeps  the  pre¬ 
dicted  values  Ci(ClHAT).  This  is  done  in  regression  (3).  Then  C1HAT  is  run  on  a  constant, 
income  and  lagged  consumption  and  the  predicted  values  are  stored  as  C2(C2TILDE).  This  is 
done  in  regression  (5).  The  last  step  runs  consumption  on  a  constant,  income,  lagged  income 
and  C2TILDE,  see  regression  (6).  The  coefficient  of  this  last  variable  is  97.43  and  is  statistically 
significant  with  a  t-value  of  16.99.  Again  Hi  is  rejected  but  H2  is  not  rejected. 
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Table  8.9  Artificial  Regression  to  compute  the  PSW  Differencing  Test 


Dependent  Variable:  CONSUM 

Method:  Least  Squares 

Sample  (adjusted):  1960  2006 

Included  observations:  47  after  adjustments 

Coefficient 

Std.  Error  t-Statistic 

Prob. 

C 

-1373.390 

226.1376  -6.073251 

0.0000 

Y 

0.596293 

0.321464  1.854930 

0.0703 

Z 

0.191494 

0.160960  1.189700 

0.2405 

R-squared 

0.993678 

Mean  dependent  var 

16693.85 

Adjusted  R-squared 

0.993390 

S.D.  dependent  var 

5210.244 

S.E.  of  regression 

423.5942 

Akaike  info  criterion 

14.99713 

Sum  squared  resid 

7895011. 

Schwarz  criterion 

15.11523 

Log  likelihood 

-349.4326 

Hannan-Quinn  criter. 

15.04157 

F-statistic 

3457.717 

Durbin- Watson  stat 

0.119325 

Prob(F-statistic) 

0.000000 

Reversing  the  roles  of  Hi  and  H2 ,  the  J  and  JA-tests  are  repeated.  In  fact,  regression  (4)  runs 
consumption  on  a  constant,  income,  lagged  consumption  and  C\  (which  was  obtained  from 
regression  (3)).  The  coefficient  on  C\  is  —15.20  and  is  statistically  significant  with  a  t- value 
of  —6.5.  This  J-test  rejects  H2  but  does  not  reject  Hi.  Regression  (7)  runs  C2  on  a  constant, 
income  and  lagged  income  and  the  predicted  values  are  stored  as  C'i(ClTILDE).  The  last  step  of 
the  JA  test  runs  consumption  on  a  constant,  income,  lagged  consumption  and  C\,  see  regression 
(8).  The  coefficient  of  this  last  variable  is  —1.11  and  is  statistically  significant  with  a  t- value  of 
—6.5.  This  JA  test  rejects  H2  but  not  H\.  The  artificial  model,  given  in  H3,  is  also  estimated, 
see  regression  (9).  One  can  easily  check  that  the  corresponding  F-tests  reject  H\  against  H3 
and  also  H2  against  H3.  In  sum,  all  evidence  indicates  that  both  Ct- 1  and  Yj_  1  are  important 
to  include  along  with  Yj.  Of  course,  the  true  model  is  not  known  and  could  include  higher  lags 
of  both  Yt  and  Ct- 

Stata  11  performs  White’s  (1982)  Information  matrix  test  by  issuing  the  command  estat 
imtest  after  running  the  regression  of  consumption  on  income.  The  results  yield: 


.  estat  imtest 

Cameron  &  Trivedi’s  decomposition  of  IM-test 


Source  | 

chi2 

df 

P 

Heteroskedasticity  | 
Skewness  | 
Kurtosis  | 

2.64 

0.45 

4.40 

2 

1 

1 

0.2677 

0.5030 

0.0359 

Total  | 

7.48 

4 

0.1124 

This  does  not  reject  the  null  even  though  Kurtosis  seems  to  be  a  problem.  Note  that  the  IM 
test  is  split  into  its  components  following  Hall  (1987)  as  described  above. 
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Table  8.10  Non-nested  J  and  JA  Tests  for  the  Consumption  Regression 


Regression  1 

Dependent  Variable:  CONSUM 

Method:  Least  Squares 

Sample  (adjusted):  1960  2007 

Included  observations:  48  after  adjustments 

Variable 

Coefficient 

Std.  Error  t-Statistic 

Prob. 

C 

-254.5241 

155.2906  -1.639019 

0.1082 

Y 

0.211505 

0.068310  3.096256 

0.0034 

CONSUM(-l) 

0.800004 

0.070537  11.34159 

0.0000 

R-squared 

0.998367 

Mean  dependent  var 

16915.21 

Adjusted  R-squared 

0.998294 

S.D.  dependent  var 

5377.825 

S.E.  of  regression 

222.1108 

Akaike  info  criterion 

13.70469 

Sum  squared  resid 

2219995. 

Schwarz  criterion 

13.82164 

Log  likelihood 

-325.9126 

Hannan-Quinn  criter. 

13.74889 

F-statistic 

13754.09 

Durbin- Watson  stat 

0.969327 

Prob(F-statistic) 

0.000000 

Regression  2 

Dependent  Variable:  CONSUM 

Method:  Least  Squares 

Sample  (adjusted):  1960  2007 

Included  observations:  48  after  adjustments 

Variable 

Coefficient 

Std.  Error  t-Statistic 

Prob. 

C 

144.3306 

125.5929  1.149194 

0.2567 

Y 

0.425354 

0.090692  4.690091 

0.0000 

Y(-l) 

-0.613631 

0.094424  -6.498678 

0.0000 

C2HAT 

1.184853 

0.069757  16.98553 

0.0000 

R-squared 

0.999167 

Mean  dependent  var 

16915.21 

Adjusted  R-squared 

0.999110 

S.D.  dependent  var 

5377.825 

S.E.  of  regression 

160.4500 

Akaike  info  criterion 

13.07350 

Sum  squared  resid 

1132745. 

Schwarz  criterion 

13.22943 

Log  likelihood 

-309.7639 

Hannan-Quinn  criter. 

13.13242 

F-statistic 

17585.25 

Durbin- Watson  stat 

1.971939 

Prob(F-statistic) 

0.000000 

8.4  Nonlinear  Least  Squares  and  the  Gauss-Newton  Regression4 

So  far  we  have  been  dealing  with  linear  regressions.  But,  in  reality,  one  might  face  a  nonlinear 
regression  of  the  form: 

yt  =  xt(/3)  +  ut  for  t  =  1,2, ...  ,T  (8.71) 

where  ut  ~  IID(0,  cr2)  and  xt(/3)  is  a  scalar  nonlinear  regression  function  of  k  unknown  param¬ 
eters  f3.  It  can  be  interpreted  as  the  expected  value  of  yt  conditional  on  the  values  of  the  inde- 
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Table  8.10  (continued) 

Regression  3 

Dependent  Variable:  CONSUM 
Method:  Least  Squares 

Sample  (adjusted):  1960  2007 

Included  observations:  48  after  adjustments 

Variable 

Coefficient 

Std.  Error  t-Statistic 

Prob. 

C 

-1424.802 

231.2843  -6.160393 

0.0000 

Y 

0.943371 

0.232170  4.063283 

0.0002 

Y(-l) 

0.040368 

0.234363  0.172244 

0.8640 

R-squared 

0.993702 

Mean  dependent  var 

16915.21 

Adjusted  R-squared 

0.993423 

S.D.  dependent  var 

5377.825 

S.E.  of  regression 

436.1488 

Akaike  info  criterion 

15.05431 

Sum  squared  resid 

8560159. 

Schwarz  criterion 

15.17126 

Log  likelihood 

-358.3033 

Hannan-Quinn  criter. 

15.09850 

F-statistic 

3550.327 

Durbin- Watson  stat 

0.174411 

Prob(F-statistic) 

0.000000 

Regression  4 

Dependent  Variable:  CONSUM 
Method:  Least  Squares 

Sample  (adjusted):  1960  2007 

Included  observations:  48  after  adjustments 

Variable 

Coefficient 

Std.  Error  t-Statistic 

Prob. 

C 

-21815.80 

3319.691  -6.571637 

0.0000 

Y 

15.01623 

2.278648  6.589974 

0.0000 

CONSUM(-l) 

0.947887 

0.055806  16.98553 

0.0000 

C1HAT 

-15.20110 

2.339106  -6.498678 

0.0000 

R-squared 

0.999167 

Mean  dependent  var 

16915.21 

Adjusted  R-squared 

0.999110 

S.D.  dependent  var 

5377.825 

S.E.  of  regression 

160.4500 

Akaike  info  criterion 

13.07350 

Sum  squared  resid 

1132745. 

Schwarz  criterion 

13.22943 

Log  likelihood 

-309.7639 

Hannan-Quinn  criter. 

13.13242 

F-statistic 

17585.25 

Durbin- Watson  stat 

1.971939 

Prob(F-statistic) 

0.000000 

pendent  variables.  Nonlinear  least  squares  minimizes  Yl~t=i(yt  ~  xt{(3))2  =  (y  —  x(P))'(y  ~  x{/3))- 
The  first-order  conditions  for  minimization  yield 

X'0)(y-x0))  =  0  (8.72) 

where  X(/3)  is  a  T  x  k  matrix  with  typical  element  Xtj(/3 )  =  dxt(f3)/d/3j  for  j  =  1, k.  The 
solution  to  these  k  equations  yield  the  Nonlinear  Least  Squares  (NLS)  estimates  of  (3  denoted  by 
Pnls-  These  normal  equations  given  in  (8.72)  are  similar  to  those  in  the  linear  case  in  that  they 
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Table  8.10  (continued) 

Regression  5 

Dependent  Variable:  C1HAT 
Method:  Least  Squares 

Sample  (adjusted):  1960  2007 

Included  observations:  48  after  adjustments 

Variable 

Coefficient 

Std.  Error  t-Statistic 

Prob. 

C 

-1418.403 

7.149223  -198.3996 

0.0000 

Y 

0.973925 

0.003145  309.6905 

0.0000 

CONSUM(-l) 

0.009728 

0.003247  2.995785 

0.0044 

R-squared 

0.999997 

Mean  dependent  var 

16915.21 

Adjusted  R-squared 

0.999996 

S.D.  dependent  var 

5360.865 

S.E.  of  regression 

10.22548 

Akaike  info  criterion 

7.548103 

Sum  squared  resid 

4705.215 

Schwarz  criterion 

7.665053 

Log  likelihood 

-178.1545 

Hannan-Quinn  criter. 

7.592298 

F-statistic 

6459057. 

Durbin- Watson  stat 

1.678118 

Prob(F-statistic) 

0.000000 

Regression  6 

Dependent  Variable:  CONSUM 
Method:  Least  Squares 

Sample  (adjusted):  1960  2007 

Included  observations:  48  after  adjustments 

Variable 

Coefficient 

Std.  Error  t-Statistic 

Prob. 

C 

138044.4 

8211.501  16.81111 

0.0000 

Y 

-94.21814 

5.603155  -16.81519 

0.0000 

Y(— 1) 

-0.613631 

0.094424  -6.498678 

0.0000 

C2TILDE 

97.43471 

5.736336  16.98553 

0.0000 

R-squared 

0.999167 

Mean  dependent  var 

16915.21 

Adjusted  R-squared 

0.999110 

S.D.  dependent  var 

5377.825 

S.E.  of  regression 

160.4500 

Akaike  info  criterion 

13.07350 

Sum  squared  resid 

1132745. 

Schwarz  criterion 

13.22943 

Log  likelihood 

-309.7639 

Hannan-Quinn  criter. 

13.13242 

F-statistic 

17585.25 

Durbin- Watson  stat 

1.971939 

Prob(F-statistic) 

0.000000 

require  the  vector  of  residuals  y  —  x((3)  to  be  orthogonal  to  the  matrix  of  derivatives  X(P).  In  the 
linear  case,  x(/3)  =  Xf3OLS  and  X (/3)  =  X  where  the  latter  is  independent  of  /3.  Because  of  this 
dependence  of  the  fitted  values  x(/3)  as  well  as  the  matrix  of  derivatives  X(/3)  on  (3,  one  in  general 
cannot  get  explicit  analytical  solution  to  these  NLS  first-order  equations.  Under  fairly  general 
conditions,  see  Davidson  and  MacKinnon  (1993),  one  can  show  that  the  (3NLS  has  asymptotically 
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Table  8.10  (continued) 

Regression  7 

Dependent  Variable:  C2HAT 
Method:  Least  Squares 

Sample  (adjusted):  1960  2007 

Included  observations:  48  after  adjustments 

Variable 

Coefficient 

Std.  Error  t-Statistic 

Prob. 

C 

-1324.328 

181.8276  -7.283424 

0.0000 

Y 

0.437200 

0.182524  2.395306 

0.0208 

Y(-l) 

0.551966 

0.184248  2.995785 

0.0044 

R-squared 

0.996101 

Mean  dependent  var 

16915.21 

Adjusted  R-squared 

0.995928 

S.D.  dependent  var 

5373.432 

S.E.  of  regression 

342.8848 

Akaike  info  criterion 

14.57313 

Sum  squared  resid 

5290650. 

Schwarz  criterion 

14.69008 

Log  likelihood 

-346.7551 

Hannan-Quinn  criter. 

14.61732 

F-statistic 

5748.817 

Durbin- Watson  stat 

0.127201 

Prob(F-statistic) 

0.000000 

Regression  8 

Dependent  Variable:  CONSUM 
Method:  Least  Squares 

Sample  (adjusted):  1960  2007 

Included  observations:  48  after  adjustments 

Variable 

Coefficient 

Std.  Error  t-Statistic 

Prob. 

C 

-1629.522 

239.4806  -6.804403 

0.0000 

Y 

1.161999 

0.154360  7.527865 

0.0000 

CONSUM(-l) 

0.947887 

0.055806  16.98553 

0.0000 

C1TILDE 

-1.111718 

0.171068  -6.498678 

0.0000 

R-squared 

0.999167 

Mean  dependent  var 

16915.21 

Adjusted  R-squared 

0.999110 

S.D.  dependent  var 

5377.825 

S.E.  of  regression 

160.4500 

Akaike  info  criterion 

13.07350 

Sum  squared  resid 

1132745. 

Schwarz  criterion 

13.22943 

Log  likelihood 

-309.7639 

Hannan-Quinn  criter. 

13.13242 

F-statistic 

17585.25 

Durbin- Watson  stat 

1.971939 

Prob(F-statistic) 

0.000000 

a  normal  distribution  with  mean  /30  and  asymptotic  variance  o'o(X,(/3q)X(/30))  1,  where  /30  and 
a o  are  the  true  values  of  the  parameters  generating  the  data.  Similarly,  defining 

s2  =  (y  -  x0NLS))\y  -  x0nls))/(T  -  k) 

we  get  a  feasible  estimate  of  this  covariance  matrix  as  s2(X,(/3)X(/3))~1.  If  the  disturbances 
are  normally  distributed  then  NLS  is  MLE  and  therefore  asymptotically  efficient  as  long  as  the 
model  is  correctly  specified,  see  Chapter  7. 
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Table  8.10  (continued) 

Regression  9 

Dependent  Variable:  CONSUM 
Method:  Least  Squares 
Sample  (adjusted):  1960  2007 
Included  observations:  48  after  adjustments 


Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

-157.2430 

113.1743 

-1.389389 

0.1717 

Y 

0.675956 

0.086849 

7.783091 

0.0000 

Y(-l) 

-0.613631 

0.094424 

-6.498678 

0.0000 

CONSUM(-l) 

0.947887 

0.055806 

16.98553 

0.0000 

R-squared 

0.999167 

Mean  dependent 

var 

16915.21 

Adjusted  R-squared 

0.999110 

S.D.  dependent  var 

5377.825 

S.E.  of  regression 

160.4500 

Akaike  info  criterion 

13.07350 

Sum  squared  resid 

1132745. 

Schwarz  criterion 

13.22943 

Log  likelihood 

-309.7639 

Hannan-Quinn  criter. 

13.13242 

F-statistic 

17585.25 

Durbin- Watson  stat 

1.971939 

Prob(F-statistic) 

0.000000 

Taking  the  first-order  Taylor  series  approximation  around  some  arbitrary  parameter  vector  /3*, 
we  get 

y  =  x(/3*)  +  X(/3*)((3  —  (3*)  +  higher-order  terms  +  u  (8.73) 

or 

y  —  x(/3*)  =  X(/3*)b  +  residuals  (8.74) 

This  is  the  simplest  version  of  the  Gauss-Newton  Regression,  see  Davidson  and  MacKinnon 
(1993).  In  this  case  the  higher-order  terms  and  the  error  term  are  combined  in  the  residuals 
and  (/?  —  /?*)  is  replaced  by  b,  a  parameter  vector  that  can  be  estimated.  If  the  model  is  linear, 
X(/3*)  is  the  matrix  of  regressors  X  and  the  GNR  regresses  a  residual  on  X.  If  (3*=/3nlsi  the 
unrestricted  NLS  estimator  of  /3,  then  the  GNR  becomes 

y  —  x  =  Xb  +  residuals  (8.75) 

where  x  =  x(f3NLS)  and  X  =  X((3NLS).  From  the  first-order  conditions  of  NLS  we  get  (y  — 
x)'X  =  0.  In  this  case,  OLS  on  this  GNR  yields  boLS  =  ( X'X)~1X'{y  —  x)  =  0  and  this  GNR 
has  no  explanatory  power.  However,  this  regression  can  be  used  to  (i)  check  that  the  first- 
order  conditions  given  in  (8.72)  are  satisfied.  For  example,  one  could  check  that  the  t-statistics 
are  of  the  10~3  order,  and  that  R2  is  zero  up  to  several  decimal  places;  (ii)  compute  estimated 
covariance  matrices.  In  fact,  this  GNR  prints  out  s2(X'X)~1,  where  s2  =  (y  —  x)' (y  —  x) / (T  —  k) 
is  the  OLS  estimate  of  the  regression  variance.  This  can  be  verified  easily  using  the  fact  that  this 
GNR  has  no  explanatory  power.  This  method  of  computing  the  estimated  variance-covariance 
matrix  is  useful  especially  in  cases  where  (3  has  been  obtained  by  some  method  other  than  NLS. 
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For  example,  sometimes  the  model  is  nonlinear  only  in  one  or  two  parameters  which  are  known 
to  be  in  a  finite  range,  say  between  zero  and  one.  One  can  then  search  over  this  range,  running 
OLS  regressions  and  minimizing  the  residual  sum  of  squares.  This  search  procedure  can  be 
repeated  over  finer  grids  to  get  more  accuracy.  Once  the  final  parameter  estimate  is  found,  one 
can  run  the  GNR  to  get  estimates  of  the  variance-covariance  matrix. 

Testing  Restrictions  (GNR  Based  on  the  Restricted  NLS  Estimates) 

The  best  known  use  for  the  GNR  is  to  test  restrictions.  These  are  based  on  the  LM  principle 
which  requires  only  the  restricted  estimator.  In  particular,  consider  the  following  competing 
hypotheses: 

H0-,y  =  x(P1,0)  +  u  H^y  =  x(P1,P2)  +  u 

where  u  ~  IID(0,  a2 1)  and  f31  and  (32  are  k  x  1  and  r  x  1,  respectively.  Denote  by  /3  the  restricted 
NLS  estimator  of  (3,  in  this  case  (3  =  (/51 , 0) . 

The  GNR  evaluated  at  this  restricted  NLS  estimator  of  (3  is 

(y  —  x)  =  X\b\  +  X2b2  +  residuals  (8.76) 

where  x  =  x{(3)  and  Xi  =  Xi(f3 )  with  Xi(f3)  =  dx/d(3i  for  i  =  1,2. 

By  the  FWL  Theorem  this  yields  the  same  estimate  of  b2  as 

(y  —  x)  =  P^X 2&2  +  residuals  (8.77) 

But  Px1i.y  —  =  (y  —  %)  —  Pxi  (y  ~  T)  =  (y  ~  T)  since  (y  —  x)  =  0  from  the  first-order 

conditions  of  restricted  NLS.  Hence,  (8.77)  reduces  to 

(y  —  x)  =  P^iX2b2  +  residuals  (8.78) 

Therefore, 

b2,OLS  =  (X!2P^X2)-'x!2P^(y  -x)  =  (X'^X.rPx^y  -  x)  (8.79) 

and  the  residual  sums  of  squares  is  ( y  —  x)'(y  —  x)  —  (y  —  x)1  X2(X!2P^^  X2)^1  X2(y  —  x). 

If  X2  was  excluded  from  the  regression  in  (8.76),  ( y  —  x)'(y  —  x)  would  be  the  residual  sum  of 
squares.  Therefore,  the  reduction  in  the  residual  sum  of  squares  brought  about  by  the  inclusion 
of  X2  is 

(y  -  xYXoiX^X^X^y  -  x) 

This  is  also  equal  to  the  explained  sum  of  squares  from  (8.76)  since  X\  has  no  explanatory 
power.  This  sum  of  squares  divided  by  a  consistent  estimate  of  a2  is  asymptotically  distributed 
as  Xr  under  the  null. 

Different  consistent  estimates  of  o2  yield  different  test  statistics.  The  two  most  common 
test  statistics  for  Hq  based  on  this  regression  are  the  following:  (1)  TR2  where  R2  is  the 
uncentered  R2  of  (8.76)  and  (2)  the  F-statistic  for  b2  =  0.  The  first  statistic  is  given  by  TR2  = 
T(y  —  x)' X2(X2Px1X2)~1X2(y  —  x)/{y  —  x)\y  —  x)  where  the  uncentered  R2  was  defined  in  the 
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Appendix  to  Chapter  3.  This  statistic  implicitly  divides  the  explained  sum  of  squares  term  by 
a2  =  (restricted  residual  sums  of  squares)/T.  This  is  equivalent  to  the  LM-statistic  obtained 
by  running  the  artificial  regression  (y  —  x)/a  on  X  and  getting  the  explained  sum  of  squares. 
Regression  packages  print  the  centered  R2 .  This  is  equal  to  the  uncentered  R2  as  long  as  there 
is  a  constant  in  the  restricted  regression  so  that  {y  —  5?)  sum  to  zero. 

The  F-statistic  for  62  =  0  from  (8.76)  is 


(RRSS  —  URSS)/r 


(y  -  x),X2{X'2P5cX2)-lX'2{y  -  x)/r 


URSS / (T  -  k)  [(y  -  x)>(y  -x)  -  (y  -  x)' X2{X'2P ^X2)-^  X'2{y  -  x)}/(T  -  k ) 


(8.80) 


The  denominator  is  the  OLS  estimate  of  a2  from  (8.76)  which  tends  to  Uq  as  T  — >  00.  Hence 
(rF-statistic  —*  Xr  )•  1 11  small  samples,  use  the  F-statistic. 

Diagnostic  Tests  for  Linear  Regression  Models 

Variable  addition  tests  suggested  by  Pagan  and  Hall  (1983)  consider  the  additional  variables 
Z  of  dimension  (T  x  r)  and  test  whether  their  coefficients  are  zero  using  an  F-test  from  the 
regression 


y  =  Xf3  +  Z7  +  u 


(8.81) 


If  Ho;  7  =  0  is  true,  the  model  is  y  =  X/3  +  u  and  there  is  no  misspecification.  The  GNR  for 
this  restriction  would  run  the  following  regression: 


Pxy  =  Xb  +  Zc  +  residuals 


(8.82) 


and  test  that  c  is  zero.  By  the  FWL  Theorem,  (8.82)  yields  the  same  residual  sum  of  squares  as 


Pxy  =  PxZc  +  residuals 


(8.83) 


Applying  the  FWL  Theorem  to  (8.81)  we  get  the  same  residual  sum  of  squares  as  the  regression 
in  (8.83).  The  F-statistic  for  7  =  0  from  (8.81)  is  therefore  identical  to  the  F-statistic  for  c  =  0 
from  the  GNR  given  in  (8.82).  Hence,  “Tests  based  on  the  GNR  are  equivalent  to  variable 
addition  tests  when  the  latter  are  applicable,”  see  Davidson  and  MacKinnon  (1993,  p.  194). 

Note  also,  that  the  nR?u  test  statistic  for  Hq\  7  =  0  based  on  the  GNR  in  (8.82)  is  exactly  the 
LM  statistic  based  on  running  the  restricted  least  squares  residuals  of  y  on  X  on  the  unrestricted 
set  of  regressors  X  and  Z  in  (8.81).  If  X  has  a  constant,  then  the  uncentered  R 2  is  equal  to 
the  centered  R2  printed  by  the  regression. 

Computational  Warning:  It  is  tempting  to  base  tests  on  the  OLS  residuals  u  =  Pxy  by 
simply  regressing  them  on  the  test  regressors  Z.  This  is  equivalent  to  running  the  GNR  without 
the  X  variables  on  the  right  hand  side  of  (8.82)  yielding  test-statistics  that  are  too  small. 


Functional  Form 

Davidson  and  MacKinnon  (1993,  p.  195)  show  that  the  RESET  with  yt  =  Xt/3  +  y2c+  residual 
which  is  based  on  testing  for  c  =  0  is  equivalent  to  testing  for  6  =  0  using  the  nonlinear  model 
yt  =  Xt/3(  1  +  OXtfd)  +  ut .  In  this  case,  it  is  easy  to  verify  from  (8.74)  that  the  GNR  is 

yt  -  Xt/3(  1  +  ext0)  =  (2 e(Xtp)Xt  +  Xt)b  +  ( xtp)2c  +  residual 
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At  6  =  0  and  (3  =  /3OLg,  the  GNR  becomes  ( yt  —  Xt(3OLS)  =  Xtb  +  (Xt/3OLS)2c- \-  residual.  The 
f-statistic  on  c  =  0  is  equivalent  to  that  from  the  RESET  regression  given  in  section  8.3,  see 
problem  25. 

Testing  for  Serial  Correlation 

Suppose  that  the  null  hypothesis  is  the  nonlinear  regression  model  given  in  (8.71),  and  the 
alternative  is  the  model  yt  =  xt(/3)  +  z 't  with  ut  =  pvt~ i  +  ut  where  ut  ~  IID(0,  a2).  Conditional 
on  the  first  observation,  the  alternative  model  can  be  written  as 

yt  =  xt((3)  +  p{yt- 1  -  xt-i(P))  +  ut 

The  GNR  test  for  Hq;  p  =  0,  computes  the  derivatives  of  this  regression  function  with  respect 
to  /3  and  p  evaluated  at  the  restricted  estimates  under  the  null  hypothesis,  i.e. ,  p  =  0  and 
j3  =  Pnls  (the  nonlinear  least  squares  estimate  of  (3  assuming  no  serial  correlation).  Those  yield 
Xt((3NLs)  and  (yt- i~xt-i(/3NLS))  respectively.  Therefore,  the  GNR  runs  ut  =  yt-xt((3NLS )  = 
Xt(/3 N Ls)b+ cut,-\  +  residual,  and  tests  that  c  =  0.  If  the  regression  model  is  linear,  this  reduces 
to  running  ordinary  least  squares  residuals  on  their  lagged  values  in  addition  to  the  regressors 
in  the  model.  This  is  exactly  the  Breusch  and  Godfrey  test  for  first-order  serial  correlation 
considered  in  Chapter  5.  For  other  applications  as  well  as  benefits  and  limitations  of  the  GNR, 
see  Davidson  and  MacKinnon  (1993). 


8.5  Testing  Linear  Versus  Log-Linear  Functional  Form5 


In  many  economic  applications  where  the  explanatory  variables  take  only  positive  values,  econo¬ 
metricians  must  decide  whether  a  linear  or  log-linear  regression  model  is  appropriate.  In  general, 
the  linear  model  is  given  by 

yi  =  Y%=iPjXij  +  Yfs=i'YsZis  +  Ui  *  =  l,2,...,n  (8.84) 

and  the  log-linear  model  is 

logyi  =  Y,j=iPjl°gXij  +  Yfs=i'ysZis  +  Ui  i  =  1,2, . . .  ,n  (8.85) 

with  ut  ~  NID(0,c2).  Note  that,  the  log-linear  model  is  general  in  that  only  the  dependent 
variable  y  and  a  subset  of  the  regressors,  i.e.,  the  X  variables  are  subject  to  the  logarithmic 
transformation.  Of  course,  one  could  estimate  both  models  and  compare  their  log-likelihood 
values.  This  would  tell  us  which  model  fits  best,  but  not  whether  either  is  a  valid  specification. 

Box  and  Cox  (1964)  suggested  the  following  transformation 


B(yi,  A) 


vt- 1 


when  A  /  0 


logy*  when  A  =  0 


(8.86) 


where  yt  >  0.  Note  that  for  A  =  1,  as  long  as  there  is  constant  in  the  regression,  subjecting  the 
linear  model  to  a  Box-Cox  transformation  is  equivalent  to  not  transformation  yields  the  log- 
linear  regression.  Therefore,  the  following  Box-Cox  model  regression.  Therefore,  the  following 
Box-Cox  model 


B(Vi,  A)  =  E*Li  (3jB(Xij ,  A)  +  YLx 


7 sZi.s  +  ut 


(8.87) 
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encompasses  as  special  cases  the  linear  and  log- linear  models  given  in  (8.84)  and  (8.85),  respec¬ 
tively.  Box  and  Cox  (1964)  suggested  estimating  these  models  by  ML  and  using  the  LR  test  to 
test  (8.84)  and  (8.85)  against  (8.87).  However,  estimation  of  (8.87)  is  computationally  burden¬ 
some,  see  Davidson  and  MacKinnon  (1993).  Instead,  we  give  an  LM  test  involving  a  Double 
Length  Regression  (DLR)  due  to  Davidson  and  MacKinnon  (1985)  that  is  easier  to  compute. 
In  fact,  Davidson  and  MacKinnon  (1993,  p.  510)  point  out  that  “everything  that  one  can  do 
with  the  Gauss-Newton  Regression  for  nonlinear  regression  models  can  be  done  with  the  DLR 
for  models  involving  transformations  of  the  dependent  variable.”  The  GNR  is  not  applicable  in 
cases  where  the  dependent  variable  is  subjected  to  a  nonlinear  transformation,  so  one  should 
use  a  DLR  in  these  cases.  Conversely,  in  cases  where  the  GNR  is  valid,  there  is  no  need  to  run 
the  DLR,  since  in  these  cases  the  latter  is  equivalent  to  the  GNR. 

For  the  linear  model  (8.84),  the  null  hypothesis  is  that  A  =  1.  In  this  case,  Davidson  and 
MacKinnon  suggest  running  a  regression  with  2n  observations  where  the  dependent  variable 
has  observations  (ei/a, . . . ,  en/a,  1, . . . ,  1/,  i.e.,  the  first  n  observations  are  the  OLS  residuals 
from  (8.84)  divided  by  the  MLE  of  a,  where  &2mie  =  e!e/n.  The  second  n  observations  are  all 
equal  to  1.  The  2 n  observations  for  the  regressors  have  typical  elements: 


for  f3j\  Xij  —  1 

for  i  =  1, 

. . .  ,n 

and 

0 

for  the  second  n  elements 

for  7S:  Zis 

for  i  =  1, 

. . .  ,n 

and 

0 

for  the  second  n  elements 

for  cr.  ei/a 

for  i  =  1, 

. . .  ,n 

and 

-1 

for  the  second  n  elements 

for  A:  J2j=i  dj(XijlogXij  -  X ^  +  1)  -  (yd og  y%  -  yl+\ )  for  *  =  1, . . . ,  n 


and  Slog y*  for  the  second  n  elements 

The  explained  sum  of  squares  for  this  DLR  provides  an  asymptotically  valid  test  for  A  =  1. 
This  will  be  distributed  as  xf  under  the  null  hypothesis. 

Similarly,  when  testing  the  log- linear  model  (8.85),  the  null  hypothesis  is  that  A  =  0.  In  this 
case,  the  dependent  variable  of  the  DLR  has  observations  (ei/u,  e^/u, . . .  ,  en/cr,  1, . . . ,  1/,  i.e., 
the  first  n  observations  are  the  OLS  residuals  from  (8.85)  divided  by  the  MLE  for  a,  i.e.,  o 
where  a2  =  e!e/n.  The  second  n  observations  are  all  equal  to  1.  The  2 n  observations  for  the 
regressors  have  typical  elements: 


for  /3j :  logXij  for  i  =  1 , . . 

.  ,n 

and 

0 

for  the  second  n  elements 

for  7S:  Zis  for  i  =  1, . . 

•  ,n 

and 

0 

for  the  second  n  elements 

for  cr:  e,;/c7  for  i  =  1, . . 

■  ,n 

and 

-1 

for  the  second  n  elements 

for  A:  l  EjC=i  Pj^ogXi:j)2  - 

5  (log  Vi)2 

for  i 

=  !,••• 

,n 

and 

clog  yt 

for  the  second  n  elements 

The  explained  sum  of  squares  from  this  DLR  provides  an  asymptotically  valid  test  for  A  =  0. 
This  will  be  distributed  as  xf  under  the  null  hypothesis. 

For  the  cigarette  data  given  in  Table  3.2,  the  linear  model  is  given  by  C  =  /30  +  /31P  +  /32Y  +  u 
whereas  the  log-linear  model  is  given  by  logC  =  7o  +  7ilogP  +  72logK  +  e  and  the  Box-Cox 
model  is  given  by  B(C,  A)  =  So  +  S\B(P1  A)  +  62B(Y,  A)  +  u,  where  B(C,  A)  is  defined  in  (8.86). 
In  this  case,  the  DLR  which  tests  the  hypothesis  that  Hq\  A  =  1,  i.e.,  the  model  is  linear,  gives 
an  explained  sum  of  squares  equal  to  15.55.  This  is  greater  than  a  xf  o,05  =  3.84  and  is  therefore 
significant  at  the  5%  level.  Similarly  the  DLR  that  tests  the  hypothesis  that  A  =  0,  i.e., 
the  model  is  log-linear,  gives  an  explained  sum  of  squares  equal  to  8.86.  This  is  also  greater 
than  xf  o  or,  =  3.84  and  is  therefore  significant  at  the  5%  level.  In  this  case,  both  the  linear  and 
log-linear  models  are  rejected  by  the  data. 
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Finally,  it  is  important  to  note  that  there  are  numerous  other  tests  for  testing  linear  and 
log-linear  models  and  the  interested  reader  should  refer  to  Davidson  and  MacKinnon  (1993). 


Notes 

1.  This  section  is  based  on  Belsley,  Kuh  and  Welsch  (1980). 

2.  Other  residuals  that  are  linear  unbiased  with  a  scalar  covariance  matrix  (LUS)  are  the  BLUS 
residuals  suggested  by  Theil  (1971).  Since  we  are  explicitly  dealing  with  time-series  data,  we  use 
subscript  t  rather  than  i  to  index  observations  and  T  rather  than  n  to  denote  the  sample  size. 

3.  Ramsey’s  (1969)  initial  formulation  was  based  on  BLUS  residuals,  but  Ramsey  and  Schmidt  (1976) 
showed  that  this  is  equivalent  to  using  OLS  residuals. 

4.  This  section  is  based  on  Davidson  and  MacKinnon  (1993,  2001). 

5.  This  section  is  based  on  Davidson  and  MacKinnon  (1993,  pp.  502-510). 

Problems 

1.  We  know  that  H  =  Px  is  idempotent.  Also,  ( In  —  Px)  is  idempotent.  Therefore,  b' Hb  >  0  for  any 
arbitrary  vector  b.  Using  these  facts,  show  for  b'  =  (1,0,..., 0)  that  0  <  hu  <  1.  Deduce  that 
0  <  hu  <  1  for  i  =  1, . . . ,  n. 

2.  For  the  simple  regression  with  no  constant  yi  =  Xi/3  +  Ui  for  i  =  1, . . . ,  n 

(a)  What  is  hu?  Verify  that  J2i=i  hu  =  1- 

(b)  What  is  8  —  /3^,  see  (8.13)?  What  is  s2^  in  terms  of  s2  and  e2,  see  (8.18)?  What  is  DFBE- 
TASij ,  see  (8.19)? 

(c)  What  are  DFFITl  and  DFFITSU  see  (8.21)  and  (8.22)? 

(d)  What  is  Cook’s  distance  measure  D2(s)  for  this  simple  regression  with  no  intercept,  see 
(8.24)? 

(e)  Verify  that  (8.27)  holds  for  this  simple  regression  with  no  intercept.  What  is  COVRATIOi , 
see  (8.26)? 

3.  From  the  definition  of  s2-.  in  (8.17),  substitute  (8.13)  in  (8.17)  and  verify  (8.18). 

4.  Consider  the  augmented  regression  given  in  (8.5)  y  =  X /3*  +  diip  +  u  where  ip  is  a  scalar  and  di  =  1 
for  the  i-th  observation  and  0  otherwise.  Using  the  Frisch- Waugh  Lovell  Theorem  given  in  section 
7.3,  verify  that 

(a)  V  =  {X'{i)X(i))-'X[i)y{i)='p^. 

(b)  p  =  (d'iPxdi)~1d'iPxy  =  ej/(l  -  hu)  where  Px  =  I  —  Px. 

(c)  Residual  Sum  of  Squares  from  (8.5)  =  (Residual  Sum  of  Squares  with  di  deleted)  —  ef/(l  — 

hii). 

(d)  Assuming  Normality  of  u,  show  that  the  t- statistic  for  testing  tp  =  0  is  t  =  p/ s.e.(p>)  =  e*  as 
given  in  (8.3). 
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5.  Consider  the  augmented  regression  y  =  X/3*  +  PxDpip*  +  u ,  where  Dp  is  an  n  x  p  matrix  of 
dummy  variables  for  the  p  suspected  observations.  Note  that  PxDp  rather  than  Dp  appear  in  this 
equation.  Compare  with  (8.6).  Let  ep  =  D'pe,  then  E(ep)  =  0,  var(ep)  =  a2DpPxDp.  Verify  that 

(a)  p  =  (X'X)-1X'y  =  POLS  and 

(b)  =  (D'pPxDp)~1D'pPxy  =  (. D'pPxDp)-'D'pe  =  (D'pPxDp)-%. 

(c)  Residual  Sum  of  Squares  =  (Residual  Sum  of  Squares  with  Dp  deleted)  —  e'p(DpPx)Dp1ep. 
Using  the  Frisch- Waugh  Lovell  Theorem  show  this  residual  sum  of  squares  is  the  same  as 
that  for  (8.6). 

(d)  Assuming  normality  of  u,  verify  (8.7)  and  (8.9). 

(e)  Repeat  this  exercise  for  problem  4  with  Pxdi  replacing  di.  What  do  you  conclude? 

6.  Using  the  updating  formula  in  (8.11),  verify  (8.12)  and  deduce  (8.13). 

7.  Verify  that  Cook’s  distance  measure  given  in  (8.25)  is  related  to  DFFITSAa )  as  follows:  DF- 
FITSiia)  =  VkDi(a). 

8.  Using  the  matrix  identity  det (fy  —  ab')  =  1  —  b'a ,  where  a  and  b  are  column  vectors  of  dimension 
k,  prove  (8.27).  Hint:  Use  a  =  Xi  and  b'  =  x'(A'X)-1  and  the  fact  that  det^UAfy)]  =det[{fy  — 
Xix'^x'xy^x'xy 

9.  For  the  cigarette  data  given  in  Table  3.2 

(a)  Replicate  the  results  in  Table  8.2. 

(b)  For  the  New  Hampshire  observation  (NH),  compute  exH,  e*NH,  /3—  (3(xh)i  DFBETAS nh , 
DFFIT NH,  DFFITSnh ,  D2NH(s),  COVRATIONH ,  and  FVARATIO Nh ■ 

(c)  Repeat  the  calculations  in  part  (b)  for  the  following  states:  AR,  CT,  NJ  and  UT. 

(d)  What  about  the  observations  for  NV,  ME,  NM  and  ND?  Are  they  influential? 

10.  For  the  Consumption-Income  data  given  in  Table  5.3,  compute 

(a)  The  internal  studentized  residuals  e  given  in  (8.1). 

(b)  The  externally  studentized  residuals  e*  given  in  (8.3). 

(c)  Cook’s  statistic  given  in  (8.25). 

(d)  The  leverage  of  each  observation  h. 

(e)  The  DFFITS  given  in  (8.22). 

(f)  The  COVRATIO  given  in  (8.28). 

(g)  Based  on  the  results  in  parts  (a)  to  (f),  identify  the  observations  that  are  influential. 

11.  Repeat  problem  10  for  the  1982  data  on  earnings  used  in  Chapter  4.  This  data  is  provided  on  the 
Springer  web  site  as  EARN.ASC. 

12.  Repeat  problem  10  for  the  Gasoline  data  provided  on  the  Springer  web  site  as  GASOLINE.DAT. 
Use  the  gasoline  demand  model  given  in  Chapter  10,  section  5.  Do  this  for  Austria  and  Belgium 
separately. 

13.  Independence  of  Recursive  Residuals. 

(a)  Using  the  updating  formula  given  in  (8.11)  with  A  =  (X'tXt)  and  a  =  —b  =  x't+1,  verify 
(8.31). 
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(b)  Using  (8.31),  verify  (8.32). 

(c)  For  ut  ~  IIN(0, <72)  and  wt+i  defined  in  (8.30)  verify  (8.33).  Hint:  define  i>t+i=\/ ft+iWt+i- 
From  (8.30),  we  have 

vt+i  =  \J  ft+iwt+i  =  yt+i  -  x't+1(3t  =  x't+l((3  -  Pt)  +  ut+i  for  t  =  k,...,T-  1 
Since  ft+ 1  is  fixed,  it  suffices  to  show  that  cov(nt+i,  i>s+i)  =  0  for  t  fy  s. 

14.  Recursive  Residuals  are  Linear  Unbiased  With  Scalar  Covariance  Matrix  (LUS). 

(a)  Verify  that  the  ( T  —  k )  recursive  residuals  defined  in  (8.30)  can  be  written  in  vector  form  as 
w  =  Cy  where  C  is  defined  in  (8.34).  This  shows  that  the  recursive  residuals  are  linear  in  y. 

(b)  Show  that  C  satisfies  the  three  properties  given  in  (8.35)  i.e.,  CX  =  0,  CC'  =  Ir-k ,  and 
C'C  =  Px-  Prove  that  CX  =  0  means  that  the  recursive  residuals  are  unbiased  with  zero 
mean.  Prove  that  the  CC'  =  Ir-k  means  that  the  recursive  residuals  have  a  scalar  covariance 
matrix.  Prove  that  C'C  =  Px  means  that  the  sum  of  squares  of  ( T  —  k)  recursive  residuals 
is  equal  to  the  sum  of  squares  of  T  least  squares  residuals. 

(c)  If  the  true  disturbances  u  ~  N( 0,  cr2/T),  prove  that  the  recursive  residuals  w  ~  N( 0,  cr2lT-k ) 
using  parts  (a)  and  (b). 

(d)  Verify  (8.36),  i.e.,  show  that  RSSt+i  =  RSSt  +  w2+1  for  t  =  k,...,T  —  1  where  RSSt  = 
{Yt-Xt0t)'(Yt-Xtpt). 

15.  The  Harvey  and  Collier  (1977)  Misspecification  t-Test  as  a  Variable  Additions  Test.  This  is  based 
on  Wu  (1993). 

(a)  Show  that  the  F-statistic  for  testing  H0;  7  =  0  versus  7  ^  0  in  (8.44)  is  given  by 

p  =  y'Pxy  -  y'P[x,z)V  _ _ y'Pzy _ 

~  y'P{x,z]y/{T  -k- 1)  “  y'{Px  -  Pz)y/(T  -k-  1) 

and  is  distributed  as  F(l,  T  —  k  —  1)  under  the  null  hypothesis. 

(b)  Using  the  properties  of  C  given  in  (8.35),  show  that  the  F-statistic  given  in  part  (a)  is  the 
square  of  the  Harvey  and  Collier  (1977)  ^statistic  given  in  (8.43). 

16.  For  the  Gasoline  data  for  Austria  given  on  the  Springer  web  site  as  GASOLINE.DAT  and  the 
model  given  in  Chapter  10,  section  5,  compute: 

(a)  The  recursive  residuals  given  in  (8.30). 

(b)  The  CUSUM  given  in  (8.46)  and  plot  it  against  r. 

(c)  Draw  the  5%  upper  and  lower  lines  given  below  (8.46)  and  see  whether  the  CUSUM  crosses 
these  boundaries. 

(d)  The  post-sample  predictive  test  for  1978.  Verify  that  computing  it  from  (8.38)  or  (8.40)  yields 
the  same  answer. 

(e)  The  modified  von  Neuman  ratio  given  in  (8.42). 

(f)  The  Harvey  and  Collier  (1977)  functional  misspecification  test  given  in  (8.43). 

17.  The  Differencing  Test  in  a  Regression  with  Equicorrelated  Disturbances.  This  is  based  on  Baltagi 
(1990).  Consider  the  time-series  regression 


Y  =  it  a.  +  X/3  +  u 


(1) 
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where  lt  is  a  vector  of  ones  of  dimension  T.  X  is  T  x  I\  and  [lt,X\  is  of  full  column  rank,  u  ~  (0,  U) 
where  f l  is  positive  definite.  Differencing  this  model,  we  get 

DY  =  DX/3  +  Du  (2) 

where  D  is  a  (T  —  1)  x  T  matrix  given  below  (8.50).  Maeshiro  and  Wichers  (1989)  show  that  GLS 
on  (1)  yields  through  partitioned  inverse: 

P  =  (. X'LX)~1X'LY  (3) 

where  L  =  fl_1  —  f Ir1  lt^'tDt1  lt)-1  .  Also,  GLS  on  (2)  yields 

'p  =  (X'MX)~1X'MY  (4) 

where  M  =  D'(DQ,D')~1D.  Finally,  they  show  that  M  =  L,  and  GLS  on  (2)  is  equivalent  to  GLS 
on  (1)  as  long  as  there  is  an  intercept  in  (1). 

Consider  the  special  case  of  equicorrelated  disturbances 

ft  =  a2[(l  -  p)IT  +  pJT\  (5) 

where  It  is  an  identity  matrix  of  dimension  T  and  Jt  is  a  matrix  of  ones  of  dimension  T . 

(a)  Derive  the  L  and  M  matrices  for  the  equicorrelated  case,  and  verify  the  Maeshiro  and  Wichers 
result  for  this  special  case. 

(b)  Show  that  for  the  equicorrelated  case,  the  differencing  test  given  by  Plosser,  Schwert,  and 
White  (1982)  can  be  obtained  as  the  difference  between  the  OLS  and  GLS  estimators  of  the 
differenced  equation  (2).  Hint:  See  the  solution  by  Koning  (1992). 

18.  For  the  1982  data  on  earnings  used  in  Chapter  4,  provided  as  EARN.ASC  on  the  Springer  web 
site,  (a)  compute  Ramsey’s  (1969)  RESET,  (b)  Compute  White’s  (1982)  information  matrix  test 
given  in  (8.69)  and  (8.70). 

19.  Repeat  problem  18  for  the  Hedonic  housing  data  given  on  the  Springer  web  site  as  HEDONIC.XLS. 

20.  Repeat  problem  18  for  the  cigarette  data  given  in  Table  3.2. 

21.  Repeat  problem  18  for  the  Gasoline  data  for  Austria  given  on  the  Springer  web  site  as  GASO¬ 
LINE. DAT.  Use  the  model  given  in  Chapter  10,  section  5.  Also  compute  the  PSW  differencing 
test  given  in  (8.54). 

22.  Use  the  1982  data  on  earnings  used  in  Chapter  4,  and  provided  on  the  Springer  web  site  as 
EARN.ASC.  Consider  the  two  competing  non-nested  models 

H0;  log(wage)  =  /30  +  fcED  +  P2EXP  +  /3SEXP2  +  /34WKS 

+p5  MS  +  P6FEM  +  P7BLK  +  P8UNION  +  u 

Hi;  log(wage)  =  70  +  ^iED  +  "f2EXP  +  'y^EXP2  +  'y^WKS 

+-f5OCC  +  "feSOUTH  +  j7SMSA  +  7  8IND  +  e 


Compute: 

(a)  The  Davidson  and  MacKinnon  (1981)  J-test  for  Hq  versus  Hi. 

(b)  The  Fisher  and  McAleer  (1981)  JA-test  for  Hq  versus  H\. 
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(c)  Reverse  the  roles  of  H0  and  Hi  and  repeat  parts  (a)  and  (b). 

(d)  Both  H0  and  Hi  can  be  artificially  nested  in  the  model  used  in  Chapter  4.  Using  the  F- 
test  given  in  (8.62),  test  for  Hq  versus  this  augmented  model.  Repeat  for  Hi  versus  this 
augmented  model.  What  do  you  conclude? 

23.  For  the  Consumption-Income  data  given  in  Table  5.3, 

(a)  Test  the  hypothesis  that  the  Consumption  model  is  linear  against  a  general  Box-Cox  alter¬ 
native. 

(b)  Test  the  hypothesis  that  the  Consumption  model  is  log-linear  against  a  general  Box-Cox 
alternative. 

24.  Repeat  problem  23  for  the  Cigarette  data  given  in  Table  3.2. 

25.  RESET  as  a  Gauss-Newton  Regression.  This  is  based  on  Baltagi  (1998).  Davidson  and  MacKinnon 
(1993)  showed  that  Ramsey’s  (1969)  regression  error  specification  test  (RESET)  can  be  derived 
as  a  Gauss-Newton  Regression.  This  problem  is  a  simple  extension  of  their  results.  Suppose  that 
the  linear  regression  model  under  test  is  given  by: 

Vt  =  X't0  +  ut  t  =  1,2, . . .  ,T  (1) 

where  j3  is  a  k  x  1  vector  of  unknown  parameters.  Suppose  that  the  alternative  is  the  nonlinear 

regression  model  between  yt  and  Xt : 

yt  =  X’tfil  +  O(X'tP)  +  7  W)2  +  A  (X'tp)3}  +  uu  (2) 

where  9 ,  7,  and  A  are  unknown  scalar  parameters.  It  is  well  known  that  Ramsey’s  (1969)  RESET 

is  obtained  by  regressing  yt  on  Xt ,  y\ ,  y f  and  yf  and  by  testing  that  the  coefficients  of  all  powers 
of  yt  are  jointly  zero.  Show  that  this  RESET  can  be  derived  from  a  Gauss-Newton  Regression  on 
(2),  which  tests  9  =  7  =  A  =  0. 
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CHAPTER  9 

Generalized  Least  Squares 

9.1  Introduction 

This  chapter  considers  a  more  general  variance  covariance  matrix  for  the  disturbances.  In  other 
words,  u  ~  (0,  cr2In)  is  relaxed  so  that  u  ~  (0,  a2Q)  where  II  is  a  positive  definite  matrix  of 
dimension  (nxn).  First  II  is  assumed  known  and  the  BLUE  for  (5  is  derived.  This  estimator  turns 
out  to  be  different  from  PolSi  and  is  denoted  by  Pols >  the  Generalized  Least  Squares  estimator 
of  (3.  Next,  we  study  the  properties  of  Pols  under  this  nonspherical  form  of  the  disturbances.  It 
turns  out  that  the  OLS  estimates  are  still  unbiased  and  consistent,  but  their  standard  errors  as 
computed  by  standard  regression  packages  are  biased  and  inconsistent  and  lead  to  misleading 
inference.  Section  9.3  studies  some  special  forms  of  II  and  derive  the  corresponding  BLUE  for  p. 
It  turns  out  that  heteroskedasticity  and  serial  correlation  studied  in  Chapter  5  are  special  cases 
of  H.  Section  9.4  introduces  normality  and  derives  the  maximum  likelihood  estimator.  Sections 
9.5  and  9.6  study  the  way  in  which  test  of  hypotheses  and  prediction  get  affected  by  this  general 
variance-covariance  assumption  on  the  disturbances.  Section  9.7  studies  the  properties  of  this 
BLUE  for  P  when  II  is  unknown,  and  is  replaced  by  a  consistent  estimator.  Section  9.8  studies 
what  happens  to  the  W,  LR  and  LM  statistics  when  u  ~  N(0,  a2Q).  Section  9.9  gives  another 
application  of  GLS  to  spatial  autocorrelation. 


9.2  Generalized  Least  Squares 

The  regression  equation  did  not  change,  only  the  variance-covariance  matrix  of  the  disturbances. 
It  is  now  cr2I7  rather  than  cr2In.  However,  we  can  rely  once  again  on  a  result  from  matrix  algebra 
to  transform  our  nonspherical  disturbances  back  to  spherical  form,  see  the  Appendix  to  Chapter 
7.  This  result  states  that  for  every  positive  definite  matrix  H,  there  exists  a  nonsingular  matrix 
P  such  that  PP'  =  H.  In  order  to  use  this  result,  we  transform  the  original  model 


y  =  XP  +  u 

(9.1) 

by  premultiplying  it  by  P_1.  We  get 

P~ly  =  P-'XP  +  P-'u 

(9.2) 

Defining  y*  as  P~1y  and  X*  and  u*  similarly,  we  have 

y*  =  X*  P  +  U* 

(9.3) 

with  u*  having  0  mean  and  var(rT)  =  P_1var(u)P^1/  =  a2 P~1QP'~~1 

=  o^p-ipp'p'-1  =  a2  jn 

Hence,  the  variance-covariance  of  the  disturbances  in  (9.3)  is  a  scalar  times  an  identity  matrix. 
Therefore,  using  the  results  of  Chapter  7,  the  BLUE  for  P  in  (9.1)  is  OLS  on  the  transformed 
model  in  (9.3) 

Pblue  =  (X*'X*)-1X*'y*  =  (. X'p-yp-1X)-1X'p-yp-1y  =  (X' X)~l X' tt~l y  (9.4) 
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with  var  (Pblue)  =  o"2( X *' x*)~2  =  a2  (X' X)~l .  This  ft  blue  is  known  as  /3GLg.  Define 
S  =  E(uu ')  =  (j2D,  then  E  differs  from  Q  only  by  the  positive  scalar  a2 .  One  can  easily  verify 
that  ftcLS  can  be  alternatively  written  as  Pols  =  {X'E^1X)~1X''E~ly  and  that  var (PGls)  = 
(X'Yj~1X)~1  .  Just  substitute  E_1  =  /a2  in  this  last  expression  for  (3qbs  and  verify  that 

this  yields  (9.4). 

It  is  clear  that  Pols  differs  from  Pols-  In  fact,  since  Pols  is  still  a  linear  unbiased  estimator 
of  P,  the  Gauss-Mar kov  Theorem  states  that  it  must  have  a  variance  larger  than  that  of  Pols- 
Using  equation  (7.5)  from  Chapter  7,  i.e.,  Pols  =  P  +  [X' X )~lX'u  it  is  easy  to  show  that 

var  (Pols)  =  a2(X' X)~\X'nX)(X' X)-1  (9.5) 

Problem  1  shows  that  var {Pols)~~  Y3X{Pgls)  is  a  positive  semi-definite  matrix.  Note  that 
vai(PoLs)  is  no  longer  cr2(X' N)-1,  and  hence  a  regression  package  that  is  programmed  to 
compute  s2{X’X)~l  as  an  estimate  of  the  variance  of  Pols  is  using  the  wrong  formula.  Fur¬ 
thermore,  problem  2  shows  that  E(s2)  is  not  in  general  a2.  Hence,  the  regression  package  is  also 
wrongly  estimating  a2  by  s2 .  Two  wrongs  do  not  make  a  right,  and  the  estimate  of  var  (Pols) 
is  biased.  The  direction  of  this  bias  depends  upon  the  form  of  H  and  the  X  matrix.  (We  saw 
some  examples  of  this  bias  under  heteroskedasticity  and  serial  correlation  in  Chapter  5).  Hence, 
the  standard  errors  and  t-statistics  computed  using  this  OLS  regression  are  biased.  Under  het¬ 
eroskedasticity,  one  can  use  the  White  (1980)  robust  standard  errors  for  OLS.  In  this  case, 
E  =  <j2Q  in  (9.5)  is  estimated  by  E  =  diag[e2]  where  e*  denotes  the  least  squares  residuals. 
The  resulting  t-statistics  are  robust  to  heteroskedasticity.  Similarly  Wald  type  statistics  for 
Hq]  RP  =  r  can  be  obtained  based  on  Pols  by  replacing  cx^X'X)-1  in  (7.41)  by  (9.5)  with 
E  =  diag[e2\.  In  the  presence  of  both  serial  correlation  and  heteroskedasticity,  one  can  use  the 
consistent  covariance  matrix  estimate  suggested  by  Newey  and  West  (1987).  This  was  discussed 
in  Chapter  5. 

To  summarize,  Pols  Is  no  l°nger  BLUE  whenever  SI  /  In.  However,  it  is  still  unbiased  and 
consistent.  The  last  two  properties  do  not  rely  upon  the  form  of  the  variance-covariance  matrix 
of  the  disturbances  but  rather  on  E{u/X )  =  0  and  plim  X'u/n  =  0.  The  standard  errors  of 
Pols  as  computed  by  the  regression  package  are  biased  and  any  test  of  hypothesis  based  on 
this  OLS  regression  may  be  misleading. 

So  far  we  have  not  derived  an  estimator  for  a2.  We  know  however,  from  the  results  in  Chapter 
7,  that  the  transformed  regression  (9.3)  yields  a  mean  squared  error  that  is  an  unbiased  estimator 
for  cr2.  Denote  this  by  s*2  which  is  equal  to  the  transformed  OLS  residual  sum  of  squares 
divided  by  (n  —  K).  Let  e*  denote  the  vector  of  OLS  residuals  from  (9.3),  this  means  that 
e*=y*~  X*PGLS  =  P~\y  -  XpGLS)  =  P-xeGLS  and 

s*2  =  e*'e* / (n  -  K)  =  (y  -  X%LS)'nr\y  -  XpGLS)/(n  -  K )  (9.6) 

=  eGLSn~leGLs/(n  ~  K) 

Note  that  s*2  now  depends  upon  D_1. 


Necessary  and  Sufficient  Conditions  for  OLS  to  be  Equivalent  to  GLS 

There  are  several  necessary  and  sufficient  conditions  for  OLS  to  be  equivalent  to  GLS,  see 
Puntanen  and  Styan  (1989)  for  a  historical  survey.  For  pedagogical  reasons,  we  focus  on  the 
derivation  of  Milliken  and  Albohali  (1984).  Note  that  y  =  Px'U  +  PxU-  Therefore,  replacing  y 
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in  [3qls  by  this  expression  we  get 

Pols  =  (X'n-'xy'X'n-'iPxy  +  Pxy]  =  Pols  +  (x'n-'xy'x'si-'Pxy 

The  last  term  is  zero  for  every  y  if  and  only  if 

XTr'Px  =  o  (9.7) 

Therefore,  Pols  =  Pols  if  and  only  if  (9-7)  is  true. 

Another  easy  necessary  and  sufficient  condition  to  check  in  practice  is  the  following: 

Pxn  =  QPx  (9.8) 

see  Zyskind  (1967).  This  involves  rather  than  iP 1 .  There  are  several  applications  in  economics 
where  these  conditions  are  satisfied  and  can  be  easily  verified,  see  Balestra  (1970)  and  Baltagi 
(1989).  We  will  apply  these  conditions  in  Chapter  10  on  Seemingly  Unrelated  Regressions, 
Chapter  11  on  simultaneous  equations  and  Chapter  12  on  panel  data.  See  also  problem  9. 


9.3  Special  Forms  of 


If  the  disturbances  are  heteroskedastic  but  not  serially  correlated,  then  U  =  diag[<r2] .  In  this  case, 
P  =  diag[<7j],  P”1  =  U-1/2  =  diag[l/<7i]  and  n_1=  diag[l/crf].  Premultiplying  the  regression 
equation  by  kl~1/2  is  equivalent  to  dividing  the  i-th  observation  of  this  model  by  cr*.  This  makes 
the  new  disturbance  Ui/ai  have  0  mean  and  homoskedastic  variance  <r2,  leaving  properties 
like  no  serial  correlation  intact.  The  new  regression  runs  y*  =  yi/cji  on  X*.  =  Xik/cti  for 
i  =  1, 2, . . . ,  n,  and  k  =  1, 2, . . . ,  K.  Specific  assumptions  on  the  form  of  these  aP s  were  studied 
in  the  heteroskedasticity  chapter. 

If  the  disturbances  follow  an  AR(1)  process  ut  =  put- 1  +  e*  for  t  =  1,  2, . . . ,  T;  with  |p|  <  1 
and  et  ~IID(0,cr2),  then  co v(ut,ut-s)  =  pS(Jy  with  <r2  =  cr2/(l  —  p2).  This  means  that 
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is  the  matrix  that  satisfies  the  following  condition  p~v p~l  =  (1  —  p2)Pl~l .  Premultiplying 
the  regression  model  by  P_1  is  equivalent  to  performing  the  Prais-Winsten  transformation.  In 
particular  the  first  observation  on  y  becomes  y\  =  \/l  —  p2y±  and  the  remaining  observations  are 
given  by  y %  =  (yt—pyt- i)  for  t  =  2,  3, . . . ,  T,  with  similar  terms  for  the  A’s  and  the  disturbances. 
Problem  3  shows  that  the  variance  covariance  matrix  of  the  transformed  disturbances  u*  =  P~1u 
is  of  br¬ 
other  examples  where  an  explicit  form  for  P_1  has  been  derived  include,  (i)  the  MA(1)  model, 
see  Balestra  (1980);  (ii)  the  AR(2)  model,  see  Lempers  and  Kloek  (1973);  (iii)  the  specialized 
AR(4)  model  for  quarterly  data,  see  Thomas  and  Wallis  (1971);  and  (iv)  the  error  components 
model,  see  Fuller  and  Battese  (1974)  and  Chapter  12. 


9.4  Maximum  Likelihood  Estimation 

Assuming  that  u  ~  1V(0,  ct212),  the  new  likelihood  function  can  be  derived  keeping  in  mind  that 
u*  =  P~lu  =  Vt~l/2u  and  u*  ~  1V(0,  a 2In).  In  this  case 

f(ut,  <? 2)  =  (l/2vrcr2)?l/2  exp {-u*'u*/2a2}  (9.12) 

Making  the  transformation  u  =  Pu*  =  ,  we  get 

f(u\, . . .  ,un;a 2)  =  (l/27ro-2)ri/2|12-1/2|  exp{— u,Pl~1u/2cr2}  (9.13) 

where  |rp1//2|  is  the  Jacobian  of  the  inverse  transformation.  Finally,  substituting  y  =  X/3  +  u 
in  (9.13),  one  gets  the  likelihood  function 

L(P,  (t2;U)  =  (l/27rcr2)n'/2|12_1/2|  exp{— (y  —  X /3)'^1(y  —  Xj3)/2a2}  (9.14) 

since  the  Jacobian  of  this  last  transformation  is  1.  Knowing  12,  maximizing  (9.14)  with  respect  to 
/3  is  equivalent  to  minimizing  u*'u*  with  respect  to  /?.  This  means  that  (3MLE  is  the  OLS  estimate 
on  the  transformed  model,  i.e.,  Pqls-  From  (9.14),  we  see  that  this  RSS  is  a  weighted  one  with 
the  weight  being  the  inverse  of  the  variance  covariance  matrix  of  the  disturbances.  Similarly, 
maximizing  (9.14)  with  respect  to  cr2  gets  o2MLe  =  the  OLS  residual  sum  of  squares  of  the 
transformed  regression  (9.3)  divided  by  n.  From  (9.6)  this  can  be  written  as  &\ile  =  e*'e*/n  = 
( n  —  K)s*2/n.  The  distributions  of  these  maximum  likelihood  estimates  can  be  derived  from 
the  transformed  model  using  the  results  in  Chapter  7.  In  fact,  Pgls  N(f5,a2{X'n~1X)-1) 
and  (n  -  K)s*2/a2  ~  xl-K- 

9.5  Test  of  Hypotheses 

In  order  to  test  PI o;  R/3  =  r,  under  the  general  variance-covariance  matrix  assumption,  one  can 
revert  to  the  transformed  model  (9.3)  which  has  a  scalar  identity  variance-covariance  matrix 
and  use  the  test  statistic  derived  in  Chapter  7 

(RPgls  -  r)'[R{X*'X*)-1R!]-\R^GLS  ~  r)/a2  ~  X2  (9.15) 


Note  that  (3Gls  replaces  Pols  and  X*  replaces  X.  Replacing  X*  by  P  1 X.  we  get 
(RPgls  -  r)'[R(X'PL-1X)-1R!]-1(R'pGL8  -  r)/a2  ~  X2 


(9.16) 
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This  differs  from  its  counterpart  in  the  spherical  disturbances  model  in  two  ways.  Pgls  replaces 
PolSi  and  (X'£l~1X)  takes  the  place  of  X' X.  One  can  also  derive  the  restricted  estimator  based 
on  the  transformed  model  by  simply  replacing  X*  by  P~1X  and  the  OLS  estimator  of  (3  by  its 
GLS  counterpart.  Problem  4  asks  the  reader  to  verify  that  the  restricted  GLS  estimator  is 

Prgls  =  Pgls  -  (X,n-1X)-1R,[R(X,n-1X)-1R,]-1(RPGLs  -  r)  (9.17) 

Furthermore,  using  the  same  analysis  given  in  Chapter  7,  one  can  show  that  (9.15)  is  in  fact 
the  Likelihood  Ratio  statistic  and  is  equal  to  the  Wald  and  Lagrangian  Multiplier  statistics, 
see  Buse  (1982).  In  order  to  operationalize  these  tests,  we  replace  a2  by  its  unbiased  estimate 
s*2,  and  divide  by  g  the  number  of  restrictions.  The  resulting  statistic  is  an  F(g,  n  —  K)  for  the 
same  reasons  given  in  Chapter  7. 


9.6  Prediction 

How  is  prediction  affected  by  nonspherical  disturbances?  Suppose  we  want  to  predict  one  period 
ahead.  What  has  changed  with  a  general  12?  For  one  thing,  we  now  know  that  the  period  (T  + 1) 
disturbance  is  correlated  with  the  sample  disturbances.  Let  us  assume  that  this  correlation  is 
given  by  the  (T  x  1)  vector  oj  =  E(ut+iu),  problem  5  shows  that  the  BLUP  for  yr+i  is 

Ut+i  =  x'T+\Pgls  +  u'Q  1(y  -  X(3GLS)/a2  (9.18) 

The  first  term  is  as  expected,  however  it  is  the  second  term  that  highlights  the  difference 
between  the  spherical  and  nonspherical  model  predictions.  To  illustrate  this,  let  us  look  at  the 
AR(1)  case  where  co v(ut,ut~s)  =  pso\-  This  implies  that  u/  =  a^(pT ,  pT~1,  ■  ■  ■ ,  p).  Using  12 
which  is  given  in  (9.9),  one  can  show  that  u  is  equal  to  po\  multiplied  by  the  last  column  of 
12.  But  12_112  =  It,  therefore,  12-1  times  the  last  column  of  12  gives  the  last  column  of  the 
identity  matrix,  i.e. ,  (0, 0, ... ,  1)'  .  Substituting  for  the  last  column  of  12  its  expression  [uj/ pa\) 
one  gets,  12“ 1  {u/ pa2)  =  (0,  0, . . . ,  1)L  Transposing  and  rearranging  this  last  expression,  we  get 
=  p{ 0, 0, . . . ,  1).  This  means  that  the  last  term  in  (9.18)  is  equal  to  p(0,  0, . . . ,  1  )(y  — 
X/3Gls)  =  PeT,GLS >  where  er,GLS  is  the  T-th  GLS  residual.  This  differs  from  the  spherical 
model  prediction  in  that  next  year’s  disturbance  is  not  independent  of  the  sample  disturbances 
and  hence,  is  not  predicted  by  its  mean  which  is  zero.  Instead,  one  uses  the  fact  that  ut+i  = 
pur  +  er+i  and  predicts  ut+i  by  pex,GLS ■  Only  ex+i  is  predicted  by  its  zero  mean  but  ut  is 
predicted  by  &t,Gls- 


9.7  Unknown  Q, 

If  12  is  unknown,  the  practice  is  to  get  a  consistent  estimate  of  H,  say  ff  and  substitute  that  in 
ft  GLS-  The  resulting  estimator  is 

Pfgls  =  (X'n-'xy'x'n-'y  (9.19) 

and  is  called  a  feasible  GLS  estimator  of  /3.  Once  H  replaces  H  the  Gauss-Markov  Theorem 
no  longer  necessarily  holds.  In  other  words,  @fGls  is  n°t  BLUE,  although  it  is  still  consistent. 
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The  finite  sample  properties  of  Pfgls  are  in  general  difficult  to  derive.  However,  we  have  the 
following  asymptotic  results. 

Theorem  1:  y/n(PGLs  —  /3)  and  \/n(/3FGLS  —  j3)  have  the  same  asymptotic  distribution 
N(0,  a2Q~1),  where  Q  =  lim (X'Ll~1X)/n  as  n  — »  oo,  if  (i)  plim  A^H-1  —  Ll~1)X/n  =  0 
and  (ii)  plim  —  Q~1)u/n  =  0.  A  sufficient  condition  for  this  theorem  to  hold  is  that  H 

is  a  consistent  estimator  of  H  and  X  has  a  satisfactory  limiting  behavior. 

Lemma  1:  If  in  addition  plim  v!(Q~ 1  —  Ll~1)u/n  =  0,  then  s*2  =  e'GLSLl~1ecLS  /  (n  ~  AT)  and 
s*2  =  e'FGLSQ~leFGLS /(n  ~  K)  are  both  consistent  for  a2.  This  means  that  one  can  perform 
test  of  hypotheses  based  on  asymptotic  arguments  using  fipGLS  and  s*2  rather  than  fdGLS  and 
s*2,  respectively.  For  a  proof  of  Theorem  1  and  Lemma  1,  see  Theil  (1971),  Schmidt  (1976)  or 
Judge  et  al.  (1985). 

Monte  Carlo  evidence  under  heteroskedasticity  or  serial  correlation  suggest  that  there  is  gain 
in  performing  feasible  GLS  rather  than  OLS  in  finite  samples.  However,  we  have  also  seen  in 
Chapter  5  that  performing  a  two-step  Cochrane- Orcutt  procedure  is  not  necessarily  better  than 
OLS  if  the  A’s  are  trended.  This  says  that  feasible  GLS  omitting  the  first  observation  (in  this 
case  Cochrane- Orcutt)  may  not  be  better  in  finite  samples  than  OLS  using  all  the  observations. 


9.8  The  W,  LR  and  LM  Statistics  Revisited 

In  this  section  we  present  a  simplified  and  more  general  proof  of  W  >  LR  >  LM  due  to 
Breusch  (1979).  For  the  general  linear  model  given  in  (9.1)  with  u  ~  JV(0,  E)  and  Hq;  R/3  =  r. 
The  likelihood  function  given  in  (9.14)  with  E  =  a2Ll,  can  be  maximized  with  respect  to 
/3  and  E  without  imposing  Hq,  yielding  the  unrestricted  estimators  (3U  and  E,  where  (3U  = 
(A^'E-1  X)-1  X'E-1?/.  Similarly,  this  likelihood  can  be  maximized  subject  to  the  restriction  Hq, 
yielding  f3r  and  E,  where 

]3r  =  (A/E_1A)_1A/E_1y  -  (A'S^A ^R'Jl  (9.20) 

as  in  (9.17),  where  ju  =  A~1(R/3r  —  r)  is  the  Lagrange  multiplier  described  in  equation  (7.35)  of 
Chapter  7  and  A  =  {R(X'Y<~1X)~1  R'].  The  major  distinction  from  Chapter  7  is  that  E  is  un¬ 
known  and  has  to  be  estimated.  Let  /3r  denote  the  unrestricted  maximum  likelihood  estimator  of 
/3  conditional  on  the  restricted  variance-covariance  estimator  E  and  let  f3u  denote  the  restricted 
maximum  likelihood  of  (3  (satisfying  Hq)  conditional  on  the  unrestricted  variance-covariance 
estimator  E.  More  explicitly, 

/3r  =  (A/E_1A)_1A/E_1y  (9.21) 

and 

~PU  =  X-  (A,E"1A TlR!A-\RPu  -  r )  (9.22) 

Knowing  E,  the  Likelihood  Ratio  statistic  is  given  by 

LR  =  — 21og[max  L(/3/E)/max  L(/3/E)]  =  — 21og[L(/3,  S)/L(/3,  E)]  (9.23) 

Rf3=r  P 
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where  u  =  y  —  X/3  and  u  =  y  —  Xf3,  both  estimators  of  /3  are  conditional  on  a  known  X. 

Rfiu  ~  NiRfrRiX'Z^Xy'R') 
and  the  Wald  statistic  is  given  by 

W  =  (RPU  -  r)'A~l{R$u  -  r )  where  A  =  [RiX'S^X^R']  (9.24) 

Using  (9.22),  it  is  easy  to  show  that  uu  =  y  —  Xf3u  and  uu  =  y  —  X (3U  are  related  as  follows: 
uu  =  uu  +  X(X'Ti-1X)-1R!A~1{R$u  -  r )  (9.25) 

and 

u'JS^Uu  =  uvY,~ 1  uu  +  ( RPU  -  r)'A~1(RPu  -  r )  (9.26) 

The  cross-product  terms  are  zero  because  =  0.  Therefore, 

W  =  u'uS-1uu-y/uE-1uu  =  -21oglL0,E)/L0,E)]  (9.27) 

=  —  21og[max  L(/3/S)/max  L(/3/E)\ 

R0=r  I3 


and  the  Wald  statistic  can  be  interpreted  as  a  LR  statistic  conditional  on  X,  the  unrestricted 
maximum  likelihood  estimator  of  X. 

Similarly,  the  Lagrange  multiplier  statistic,  which  tests  that  ^  =  0,  is  given  by 


LM  =  fi'Afj,  =  ( Rf3r  -  r)' A~l{R[3r  -  r) 

(9.28) 

Using  (9.20)  one  can  easily  show  that 

ur  =  ur  +  X(X'^~lX)~lR'A-l{R$r  -  r ) 

(9.29) 

and 

u(,X_1ur  =  v!rY,-lUr  +  ll  Afl 

(9.30) 

The  cross-product  terms  are  zero  because  XX^xur  =  0.  Therefore, 

LM  =  u'^Ur  -  u'^Ur  =  -21og[L(/3r,  E)/L(pr,  X)] 

(9.31) 

=  —  21og[max  L(/3/X)/maxL(/3/X)] 

R0=r  I3 


and  the  Lagrange  multiplier  statistic  can  be  interpreted  as  a  LR  statistic  conditional  on  X  the 
restricted  maximum  likelihood  of  X.  Given  that 


max  L(/3/X)  <  max  L(/3,  X)  =  max:  L(/3/X) 

(9.32) 

max  L(/3/X)  <  max  L(/3,  X)  =  max  L(/3/X) 

R/3=r  R/3=r,'S  R/3=r 

(9.33) 

it  can  be  easily  shown  that  the  likelihood  ratio  statistic  given  by 

LR  =  — 21og[max  L(/3,  X)/max  L(/3,  X)1 

i?/3=r,E  0,  S 

(9.34) 

satisfies  the  following  inequality 

W  >  LR>  LM 

(9.35) 

The  proof  is  left  to  the  reader,  see  problem  6. 

This  general  and  simple  proof  holds  as  long  as  the  maximum  likelihood  estimator  of  (3  is 
uncorrelated  with  the  maximum  likelihood  estimator  of  X,  see  Breusch  (1979). 
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9.9  Spatial  Error  Correlation1 

Unlike  time-series,  there  is  typically  no  unique  natural  ordering  for  cross-sectional  data.  Spatial 
autocorrelation  permit  correlation  of  the  disturbance  terms  across  cross-sectional  units.  There 
is  an  extensive  literature  on  spatial  models  in  regional  science,  urban  economics,  geography 
and  statistics,  see  Anselin  (1988).  Examples  in  economics  usually  involve  spillover  effects  or 
externalities  due  to  geographical  proximity.  For  example,  the  productivity  of  public  capital,  like 
roads  and  highways,  on  the  output  of  neighboring  states.  Also,  the  pricing  of  welfare  in  one 
state  that  pushes  recipients  to  other  states.  Spatial  correlation  could  relate  directly  to  the  model 
dependent  variable  y,  the  exogenous  variables  X,  the  disturbance  term  it,  or  to  a  combination 
of  all  three.  Here  we  consider  spatial  correlation  in  the  disturbances  and  leave  the  remaining 
literature  on  spatial  dependence  to  the  motivated  reader  to  pursue  in  Anselin  (1988,  2001)  and 
Anselin  and  Bera  (1998)  to  mention  a  few. 

For  the  cross-sectional  disturbances,  the  spatial  autocorrelation  is  specified  as 

u  =  A  Wu  +  e  (9.36) 

where  A  is  the  spatial  autoregressive  coefficient  satisfying  |A|  <  1,  and  e  IIN(0,  a2).  IF  is  a 
known  spatial  weight  matrix  with  diagonal  elements  equal  to  zero.  W  also  satisfies  some  other 
regularity  conditions  like  the  fact  that  In  —  A W  must  be  nonsingular. 

The  regression  model  given  in  (9.1)  can  be  written  as 

y  =  Xp  +  (In-XW)-1e  (9.37) 


with  the  variance-covariance  matrix  of  the  disturbances  given  by 

E  =  ct2Q  =  a2(In  -  \W)-l(In  -  XW')~l  (9.38) 

Under  normality  of  the  disturbances,  Ord  (1975)  derived  the  maximum  likelihood  estimators 

InL  =  -hn\n\  -  |ln27ru2  -  (y  -  Xp),QT1{y  -  Xf3)/2a2  (9.39) 

The  Jacobian  term  simplifies  by  using 


ln|H|  =  — 21n| /  -  AJU|  =  -2  £J*=1  ln(l  -  \Wi) 


(9.40) 


where  Wi  are  the  eigenvalues  of  the  spatial  weight  matrix  W.  The  first-order  conditions  yield 
the  familiar  GLS  estimator  of  (3  and  the  associated  estimator  of  a2: 


Pmle  ~  {X'Q  X)  X'Vt  y  and  a mle  —  eMLE ^  &mle !n 


(9.41) 


where  eMLE  =  y  —  Xj3MLE.  An  estimate  of  A  can  be  obtained  using  the  iterative  solution  of 
the  first-order  conditions  in  Magnus  (1978,  p.  283): 


1 

~2  tr 


dQ 


-l 


d\ 


n 


—  eMLE 


on 


-i 


d\ 


eMLE 


(9.42) 
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where 


dQTx/d\  =  -W-W'  +  A  W'W  (9.43) 

Alternatively,  one  can  substitute  /3MLE  and  ct2mle  from  (9.41)  into  the  log-likelihood  in  (9.39) 
to  get  the  concentrated  log- likelihood  which  will  be  a  nonlinear  function  of  A,  see  Anselin  (1988) 
for  details. 

Testing  for  zero  spatial  autocorrelation  i.e. ,  Hq\  A  =  0  is  usually  based  on  the  Moran  I-test 
which  is  similar  to  the  Durbin- Watson  statistic  in  time-series.  This  is  given  by 


n 

MI=- 


(9.44) 


where  e  denotes  the  vector  of  OLS  residuals  and  So  is  a  standardization  factor  equal  to  the  sum 
of  the  spatial  weights  Yli=  l  wd  •  For  a  row-standardized  weights  matrix  W  where  each  row 

sums  to  one,  So  =  n  and  the  Moran  /-statistic  simplifies  to  e'We/e'e.  In  practice  the  test  is 
implemented  by  standardizing  it  and  using  the  asymptotic  1V(0, 1)  critical  values,  see  Anselin 
and  Bera  (1988).  In  fact,  for  a  row-standardized  W  matrix,  the  mean  and  variance  of  the  Moran 
/-statistic  is  obtained  from 


E(MI)  =  E 


tr  (PxW)/(n-k) 


(9.45) 


and 


2  =  tr  (Py  WPx  W')  +  tljPxW)2  +  MPxW)}2 

( n  —  k)(n  —  k  +  2) 

Alternatively,  one  can  derive  the  Lagrange  Multiplier  test  for  Hq ;  A  =  0  using  the  result  that 
dlnL/dX  evaluated  under  the  null  of  A  =  0  is  equal  to  u'Wu/c r2  and  the  fact  that  the  Information 
matrix  is  block-diagonal  between  (3  and  (a2,  A),  see  problem  14.  In  fact,  one  can  show  that 


T  M  (e'We/a2)2 
A  tr  [{W  +  W)W] 

with  a2  =  e'e/n.  Under  Ho,  LM\  is  asymptotically  distributed  as  Xv  One  can  clearly  see  the 
connection  between  Moran’s  /-statistic  and  LM\.  Computationally,  the  W  and  LR  tests  are 
more  demanding  since  the  require  ML  estimation  under  spatial  autocorrelation. 

This  is  only  a  brief  introduction  into  the  spatial  dependence  literature.  Hopefully,  it  will  moti¬ 
vate  the  reader  to  explore  alternative  formulations  of  spatial  dependence,  alternative  estimation 
and  testing  methods  discussed  in  this  literature  and  the  numerous  applications  in  economics  on 
hedonic  housing,  crime  rates,  police  expenditures  and  R&D  spillovers,  to  mention  a  few. 


Note 


1.  This  section  is  based  on  Anselin  (1988,  2001)  and  Anselin  and  Bera  (1998). 
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Problems 

1.  GLS  Is  More  Efficient  than  OLS. 

(a)  Using  equation  (7.5)  of  Chapter  7,  verify  that  var(/3OLS)  is  that  given  in  (9.5). 

(b)  Show  that  var(/30ijS)  —  var (/3GLS)  =  a2  ALIA'  where 

a  =  [(x'x^x'  -  (x'fr1*)-1^-1]. 

Conclude  that  this  difference  in  variances  is  positive  semi-definite. 

2.  s2  Is  No  Longer  Unbiased  for  cr2. 

(a)  Show  that  E(s2)  =  cr2tr(f lPx)/(n  —  K )  cr2.  Hint:  Follow  the  same  proof  given  below 

equation  (7.6)  of  Chapter  7,  but  substitute  cr2fl  instead  of  a2In. 

(b)  Use  the  fact  that  Px  and  E  are  non-negative  definite  matrices  with  tr(EPx)  >  0  to  show 
that  0  <  E(s2)  <  tr(E )/(n  —  K )  where  tr(E)  =  J2^=1cr2  with  a2  =  var(uj)  >  0.  This 
bound  was  derived  by  Dufour  (1986).  Under  homoskedasticity,  show  that  this  bound  becomes 
0  <  E(s2)  <  na2/(n  —  K).  In  general,  0  <  {mean  of  n  —  K  smallest  characteristic  roots  of 
2}  <  E(s2)  <  {mean  of  n  —  K  largest  characteristic  roots  of  E}  <  tr(E) /(n  —  K ),  see  Sathe 
and  Vinod  (1974)  and  Neudecker  (1977,  1978). 

(c)  Show  that  a  sufficient  condition  for  s 2  to  be  consistent  for  cr2  irrespective  of  X  is  that 
Amaj  =  the  largest  characteristic  root  of  Ll  is  o(n),  i.e.,  AmQE/n  — >  0  as  n  — >  oo  and  plim 
( u'u/n )  =  cr2.  Hint:  s2  =  u'Pxu/{n  —  K)  =  u'u/(n  —  K)  —  v! Pxu/ (n  —  K).  By  assumption, 
the  first  term  tends  in  probability  limits  to  cr2  as  n  — >  oo.  The  second  term  has  expectation 
cr2tr (PxLl)/{n  —  K ).  Now  PxLl  has  rank  K  and  therefore  exactly  K  non-zero  characteristic 
roots  each  of  which  cannot  exceed  \max ■  This  means  that  E[u'Pxu/ (n— K)\  <  cr2K\max/ ( n — 
K).  Using  the  condition  that  \max/n  —*  0  proves  the  result.  See  Kramer  and  Berghoff  (1991). 

(d)  Using  the  same  reasoning  in  part  (a),  show  that  s*2  given  in  (9.6)  is  unbiased  for  cr2. 

3.  The  AR(1)  Model.  See  Kadiyala  (1968). 

(a)  Verify  that  UfU1  =  It  for  U  and  U_1  given  in  (9.9)  and  (9.10),  respectively. 

(b)  Show  that  P~v P-1  =  (1  —  p2)U_1  for  P~1  defined  in  (9.11). 

(c)  Conclude  that  var(P-1u)  =  a 2It-  Hint:  Q  =  (1  —  p2)PP '  as  can  be  easily  derived  from  part 
(b). 

4.  Restricted  GLS.  Using  the  derivation  of  the  restricted  least  squares  estimator  for  u  ~  (0,cr2/„)  in 
Chapter  7,  verify  equation  (9.17)  for  the  restricted  GLS  estimator  based  onu~  (0,cr2U).  Hint: 
Apply  restricted  least  squares  results  to  the  transformed  model  given  in  (9.3). 

5.  Best  Linear  Unbiased  Prediction.  This  is  based  on  Goldberger  (1962).  Consider  all  linear  predictors 
of  Ut+s  =  XT+SP  +  ut+s  °f  the  form  yr+s  =  c'y,  where  u  ~  (0,  E)  and  E  =  a2fl. 

(a)  Show  that  dX  =  x'T+s  for  yr+s  to  be  unbiased. 

(b)  Show  that  var  (yr+s)  =  c'Ec-l-  crf+s  —  2  dui  where  var(uT+s)  =  &t+s  an(l  w  =  E(ut+su)- 

(c)  Minimize  var(yr_|_s)  given  in  part  (b)  subject  to  d X  =  x'T+s  and  show  that 

c  =  E -''-[It  -  ^(X'E-1^)-1^^-1]^  +  Y,-1X(X'Y,-1X)-1xt+s 

This  means  that  yT+s  =  dy  =  x'T+JlGLS  +  w,E_1eGL5  =  x'T+s/3GLS  +  J Ll~1  eGLS / o'2 •  For 
s  =  1,  i.e.,  predicting  one  period  ahead,  this  verifies  equation  (9.18).  Hint:  Use  partitioned 
inverse  in  solving  the  first-order  minimization  equations. 
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(d)  Show  that  tjt+s  =  x't+sPgls  +  PSeT,GLS  for  the  stationary  AR(1)  disturbances  with  autore¬ 
gressive  parameter  p,  and  |p|  <  1. 

6.  The  W,  LR  and  LM  Inequality.  Using  the  inequalities  given  in  equations  (9.32)  and  (9.33)  verify 
equation  (9.35)  which  states  that  W  >  LR  >  LM.  Hint:  Use  the  conditional  likelihood  ratio 
interpretations  of  W  and  LM  given  in  equations  (9.27)  and  (9.31)  respectively. 

7.  Consider  the  simple  linear  regression 

yi  =  a  +  /3Xi  +  Ui  i  =  l,2,...,n 

with  Ui  ~  IIN(0,  a2).  For  Hq;  fj  =  0,  derive  the  LR,  W  and  LM  statistics  in  terms  of  conditional  like¬ 
lihood  ratios  as  described  in  Breusch  (1979).  In  other  words,  compute  W  =  —  21og[max  L(a,  (3/ a2) / 

Ho 

max  L(a,  P/a2)],  LM  =  —  21og[max  L(a,  P  fa2) /max.  L[a,P/a2))  and  LR  =  —  21og[max  L(a,  P,a2)/ 
ot,(3  Hq  a,  (3  Hq 

max  L(a,  (3,  a2)]  where  a2  is  the  unrestricted  MLE  of  a2  while  a2  is  the  restricted  MLE  of  cr2  under 
a, (3, a2 

Hq.  Use  these  results  to  infer  that  W  >  LR  >  LM. 

8.  Sampling  Distributions  and  Efficiency  Comparison  of  OLS  and  GLS.  Consider  the  following  re¬ 
gression  model  yt  =  /3xt  +  ut  for  (t  =  1,2),  where  /3  =  2  and  xt  takes  on  the  fixed  values  x\  =  1, 
X2  =  2.  The  uf  s  have  the  following  discrete  joint  probability  distribution: 


{ui,u2) 

Probability 

(-1,-2) 

1/8 

(1,-2) 

3/8 

(-1,2) 

3/8 

(1,2) 

1/8 

(a)  What  is  the  variance-covariance  matrix  of  the  disturbances?  Are  the  disturbances  het- 
eroskedastic?  Are  they  correlated? 

(b)  Find  the  sampling  distributions  of  Pols  an(i  Pgls  and  verify  that  var {Pols)  >  var(/3GI/S). 

(c)  Find  the  sampling  distribution  of  the  OLS  residuals  and  verify  that  the  estimated  var(/30iS) 
is  biased.  Also,  find  the  sampling  distribution  of  the  GLS  residuals  and  verify  that  the  MSE 
of  the  GLS  regression  is  an  unbiased  estimator  of  the  GLS  regression  variance.  Hint:  Read 
Oksanen  (1991)  and  Phillips  and  Wickens  (1978),  pp.  3-4.  This  problem  is  based  on  Baltagi 
(1992).  See  also  the  solution  by  Im  and  Snow  (1993). 

9.  Equi- correlation.  This  problem  is  based  on  Baltagi  (1998).  Consider  the  regression  model  given 
in  (9.1)  with  equi- correlated  disturbances,  i.e. ,  equal  variances  and  equal  covariances:  E(uu')  = 
a2Q  =  a2  [(1  —  p)It  +  where  it  is  a  vector  of  ones  of  dimension  T  and  It  is  the  identity 

matrix.  In  this  case,  var(rtt)  =  a2  and  cov(ut,us)  =  pa2  for  t  f  s  with  t  =  1,2, ...  ,T.  Assume 
that  the  regression  has  a  constant. 

(a)  Show  that  OLS  on  this  model  is  equivalent  to  GLS.  Hint:  Verify  Zyskind’s  condition  given 
in  (9.8)  using  the  fact  that  Px^t  =  Pr  if  £ t  is  a  column  of  X. 

(b)  Show  that  E(s2)  =  a2{l  —  p).  Also,  that  D  is  positive  semi-definite  when  — 1/(T— 1)  <  p  <  1. 
Conclude  that  if  — 1/(T  —  1)  <  p  <  1,  then  0  <  E(s2)  <  [T/(T  —  l)]cr2.  The  lower  and  upper 
bounds  are  attained  at  p  =  1  and  p  =  — 1/(T  —  1),  respectively,  see  Dufour  (1986).  Hint:  U 
is  positive  semi-definite  if  for  every  arbitrary  non-zero  vector  a  we  have  a'fla  >  0.  What  is 
this  expression  for  a  =  iffi 

(c)  Show  that  for  this  equi-correlated  regression  model,  the  BLUP  of  yr+i  =  xf+iP  +  ut+i  is 
yr+i  =  x't+iPols  as  i°ng  as  there  is  a  constant  in  the  model. 
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10.  Consider  the  simple  regression  with  no  regressors  and  equi-correlated  disturbances: 

yi  =  a  +  Ui  i=l,...,n 
where  E(uf)  =  0  and 

co  v(ui,Uj)  =  pa2  for 

=  <t2  for  i  =  j 

with  j-j  <  p  <  1  for  the  variance-covariance  matrix  of  the  disturbances  to  be  positive  definite. 

(a)  Show  that  the  OLS  and  GLS  estimates  of  a  are  identical.  This  is  based  on  Kruskal  (1968). 

(b)  Show  that  the  bias  in  s  ,  the  OLS  estimator  of  a2,  is  given  by  —pa2. 

(c)  Show  that  the  GLS  estimator  of  a2  is  unbiased. 

(d)  Show  that  the  ^[estimated  var(S)—  true  var(Sois)]  is  also  —pa2 . 

11.  Prediction  Error  Variances  Under  Heteroskedasticity.  This  is  based  on  Termayne  (1985).  Consider 
the  t- th  observation  of  the  linear  regression  model  given  in  (9.1). 

yt  =  x't/3  +  ut  t  =  1,2, . . .  ,T 

where  yt  is  a  scalar  x't  is  1  x  K  and  (3  is  a  K  x  1  vector  of  unknown  coefficients,  ut  is  assumed  to 
have  zero  mean,  heteroskedastic  variances  E(u2)  =  (^q)2  where  z[  is  a  1  x  r  vector  of  observed 
variables  and  7  is  an  r  x  1  vector  of  parameters.  Furthermore,  these  ut’s  are  not  serially  correlated, 
so  that  E(utus)  =  0  for  t  yf  s. 

(a)  Find  the  var(/3CiS)  and  var(/3GiS)  for  this  model. 

(b)  Suppose  we  are  forecasting  y  for  period  /  in  the  future  knowing  xf,  i.e.,  yf  =  x'f/3  +  Uf  with 
f  >  T.  Let  e/  and  ej  be  the  forecast  errors  derived  using  OLS  and  GLS,  respectively.  Show 
that  the  prediction  error  variances  of  the  point  predictions  of  yf  are  given  by 

var(e/)  =  x'fiY^^XtX^iY^^iXtX^z^^^iXtXt^Xf  +  (z'p)2 
var(e»  =  Z/ELi  Xtx'^z'ti)2}-1  x  s  +  {z'pf 

(c)  Show  that  the  variances  of  the  two  forecast  errors  of  conditional  mean  E{yj/xf)  based 
upon  (3 OLS  and  f3GLS  and  denoted  by  Cf  and  cy,  respectively  are  the  first  two  terms  of  the 
corresponding  expressions  in  part  (b). 

(d)  Now  assume  that  K  =  1  and  r  =  1  so  that  there  is  only  one  single  regressor  Xt  and  one 
zt  variable  determining  the  heteroskedasticity.  Assume  also  for  simplicity  that  the  empirical 
moments  of  xt  match  the  population  moments  of  a  Normal  random  variable  with  mean  zero 
and  variance  9.  Show  that  the  relative  efficiency  of  the  OLS  to  the  GLS  predictor  of  yf  is 
equal  to  (T  +  1)/(T+  3),  whereas  the  relative  efficiency  of  the  corresponding  ratio  involving 
the  two  predictions  of  the  conditional  mean  is  (1/3). 

12.  Estimation  of  Time  Series  Regressions  with  Autoregressive  Disturbances  and  Missing  Observations. 
This  is  based  on  Baltagi  and  Wu  (1997).  Consider  the  following  time  series  regression  model, 

Vt  =  x't/3 +  ut  t  =  1, . . .  ,T, 

where  (3  is  a  K  x  1  vector  of  regression  coefficients  including  the  intercept.  The  disturbances  follow 
a  stationary  AR(1)  process,  that  is, 

ut  =  put- 1  +  et, 
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with  \p\  <  1,  et  is  IIN(0,  of),  and  u0  ~  iV(0,of/(l  —  p2)).  This  model  is  only  observed  at  times  tj 
for  j  =  1, . . . ,  n  with  1  =  t\  <  . . .  <  tn  =  T  and  n  >  K .  The  typical  covariance  element  of  Ut  for 
the  observed  periods  tj  and  ts  is  given  by 


cov(ut  uts)  = 


1  ~  p2 


-,1b  for  j  =  1, . . . . 


Knowing  p,  derive  a  simple  Prais-Winsten-type  transformation  that  will  obtain  GLS  as  a  simple 
least  squares  regression. 

13.  Multiplicative  Heteroskedasticity.  This  is  based  on  Harvey  (1976).  Consider  the  linear  model  given 
in  (9.1)  and  let  u  ~  iV(0,  E)  where  S  =  diag[crf].  Assume  that  of  =  a2hi{9)  with  6'  =  (9±, . . .  ,9S) 
and  hi(9)  =  exp(9izu  +  . . .  +  9szsi)  =  exp (z'9)  with  z'  =  (z u, . . . ,  zsi ). 

(a)  Show  that  log-likelihood  function  is  given  by 

log  L(fi,  9,c J2)  =  -  y  log  2t rcr2  -  1  EZi  log  K{.9)  -  ^  EZi  ^  h.^ 

and  the  score  with  respect  to  9  is 

o,  T  1  v-Af  1  dhi  1  at  (yi  -  x'ifi)2  dhi 

aiog  L/ae  =  -  -  E„i  m  w  ^  E,.i  (M9))2  •  w 

Conclude  that  for  multiplicative  heteroskedasticity,  equating  this  score  to  zero  yields 

v-'iV  ( JJi  _  2 

^i=1  exp  (z'9)  *i_<T  U=lZi~ 

(b)  Show  that  the  Information  matrix  is  given  by 

X'Z^X  0 


I((3,9,a2)  = 


0 


0 


1  s-^N 

2  ^i=1 


1  dhi  dhi 


(. hi(9 ))2  d9  d9 '  2a2 

1  dhi 


0 


1  N  1  dhi 
^ 1=1  hi(9)  d9 


1  s-^N _ 

2ct2  ^ i=1  hi{9)  d9' 


N 

2a1 


and  for  multiplicative  heteroskedasticity  this  becomes 
X'YrxX  0  0 

1 


I((3,9,a2)  = 


0 


-Z'Z 


1 

2u2  ^i=1 


1  S-^N  ,  N 

2a2  ^i=1  Zi  2a4 


where  Z[  =  (zx, . . . ,  zN). 

(c)  Assume  that  hi(9)  satisfies  hi( 0)  =  1,  then  the  test  for  heteroskedasticity  is  Hq]  9  =  0  versus 
H i;  9^0.  Show  that  the  score  with  respect  to  9  and  a2  evaluated  under  the  null  hypothesis, 
i.e. ,  at  9  =  0  and  a2  =  e'e/N  is  given  by 


yN  z- 

Z^i=i  ** 


s  = 


-  i 
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where  e  denotes  the  vector  of  OLS  residuals.  The  Information  matrix  with  respect  to  9  and 
cr2can  be  obtained  from  the  bottom  right  block  of  /(/?,  6 ,  er2)  given  in  part  (b).  Conclude  that 
the  score  test  for  Hq  is  given  by 


This  statistic  is  asymptotically  distributed  as  y2  under  H0.  From  Chapter  5,  we  can  see  that 
this  is  a  special  case  of  the  Breusch  and  Pagan  (1979)  test-statistic  which  can  be  obtained  as 
one-half  the  regression  sum  of  squares  of  e2/a2  on  a  constant  and  Z.  Koenker  and  Bassett 
(1982)  suggested  replacing  the  denominator  2of4  by  JV=1(e2  —  a2)2 /N  to  make  this  test  more 
robust  to  departures  from  normality. 

14.  Spatial  Autocorrelation.  Consider  the  regression  model  given  in  (9.1)  with  spatial  autocorrelation 
defined  in  (9.36). 

(a)  Verify  that  the  first-order  conditions  of  maximization  of  the  log-likelihood  function  given  in 
(9.39)  yield  (9.41). 

(b)  Show  that  for  testing  H0;  A  =  0,  the  score  d\nL/dX  evaluated  under  the  null,  i.e. ,  at  A  =  0, 
is  given  by  u'Wu/a2. 

(c)  Show  that  the  Information  matrix  with  respect  to  <r2and  A,  evaluated  under  the  null  of  A  =  0, 
is  given  by 

n  tr(W) 

2<t4  a2 

^ -  tr(TV2)  +  trtW'W) 

(d)  Conclude  from  parts  (b)  and  (c)  that  the  Lagrange  Multiplier  for  iJ0;  A  =  0  is  given  by  LM\ 
in  (9.46).  Hint:  Use  the  fact  that  the  diagonal  elements  of  W  are  zero,  hence  tr(W)  =  0. 

15.  Neighborhood  Effects  and  Housing  Demand.  Ioannides  and  Zabel  (2003)  use  data  from  the  Amer¬ 
ican  Housing  Survey  to  estimate  a  model  of  housing  demand  with  neighborhood  effects.  The 
number  of  observations  on  housing  units  used  were  1947  in  1985,  2318  in  1989  and  2909  in  1993. 
The  housing  survey  has  detailed  information  for  each  of  these  housing  units  and  their  owners, 
including:  the  owner’s  schooling,  whether  the  owner  is  white,  whether  the  owner  is  married,  the 
number  of  persons  in  the  household,  household  income,  and  whether  the  house  has  changed  owners 
(“changed  hands”)  in  the  last  5  years.  In  addition,  the  current  owner’s  evaluation  of  the  housing 
unit’s  market  value,  as  well  as  various  structural  characteristics  of  the  housing  unit  (such  as  num¬ 
ber  of  bedrooms,  bathrooms,  and  whether  the  house  has  a  garage).  The  variable  definitions  are 
given  in  Table  VI  of  Ioannides  and  Zabel  (2003,  p.  568)  and  the  data  is  available  from  the  Journal 
of  Applied  Econometrics  archive: 

(a)  Replicate  Table  VII  of  Ioannides  and  Zabel  (2003,  p.  569)  which  displays  the  means  and 
standard  deviations  for  some  of  the  variables  by  year  and  for  the  pooled  sample.  Note  that 
the  price  and  income  variables  are  different  from  the  numbers  reported  in  the  paper. 

(b)  Replicate  Table  VIII  of  Ioannides  and  Zabel  (2003,  p.  577)  which  reports  regression  results  of 
housing  demand.  Note  that  these  regressions  are  different  from  those  reported  in  the  paper. 
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CHAPTER  10 

Seemingly  Unrelated  Regressions 


When  asked  “How  did  you  get  the  idea  for  SUR?”  Zellner  responded:  “On  a  rainy 
night  in  Seattle  in  about  1956  or  1957,  I  somehow  got  the  idea  of  algebraically  writing 
a  multivariate  regression  model  in  single  equation  form.  When  I  figured  out  how  to  do 
that,  everything  fell  into  place  because  then  many  univariate  results  could  be  carried 
over  to  apply  to  the  multivariate  system  and  the  analysis  of  the  multivariate  system 
is  much  simplified  notationally,  algebraically  and,  conceptually.  ”  Read  the  interview 
of  Professor  Arnold  Zellner  by  Rossi  (1989,  p.  292). 


10.1  Introduction 


Consider  two  regression  equations  corresponding  to  two  different  firms 


yi  =  Xif3i  +  ui  *  =  1,2 


(10.1) 


where  y,  and  ut  are  T  x  1  and  Xi  is  (T  x  Kf)  with  m  ~  (0,  cTij/r).  OLS  is  BLUE  on  each 
equation  separately.  Zellner’s  (1962)  idea  is  to  combine  these  Seemingly  Unrelated  Regressions 
in  one  stacked  model,  i.e. , 


V  i 
V2 


Xl  0  ](  /?!  \  (  UI  \ 

0  x2  \  V  p2  )  v  ^2  ) 


(10.2) 


which  can  be  written  as 


y  =  X/3  +  u  (10.3) 

where  y'  =  (y\ ,  y'2)  and  X  and  u  are  obtained  similarly  from  (10.2).  y  and  u  are  2 T  x  1,  X  is 
2 T  x  (K i  +  K2)  and  (3  is  (K\  +  K2)  x  1.  The  stacked  disturbances  have  a  variance-covariance 
matrix 

crn  It  vuIt 

<721  It  <7 22  It 

where  E  =  [crij\  for  i,j  =  1,2;  with  p  =  012/ \Jcr  11^22  measuring  the  extent  of  correlation 
between  the  two  regression  equations.  The  Kronecker  product  operator  <g>  is  defined  in  the  Ap¬ 
pendix  to  Chapter  7.  Some  important  applications  of  SUR  models  in  economics  include  the 
estimation  of  a  system  of  demand  equations  or  a  translog  cost  function  along  with  its  share 
equations,  see  Berndt  (1991).  Briefly,  a  system  of  demand  equations  explains  household  con¬ 
sumption  of  several  commodities.  The  correlation  among  equations  could  be  due  to  unobservable 
household  specific  attributes  that  influence  the  consumption  of  these  commodities.  Similarly,  in 
estimating  a  cost  equation  along  with  the  corresponding  input  share  equations  based  on  firm 
level  data.  The  correlation  among  equations  could  be  due  to  unobservable  firm-specific  effects 
that  influence  input  choice  and  cost  in  production  decisions. 


=  S  <8>  It 


(10.4) 
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Problem  1  asks  the  reader  to  verify  that  OLS  on  the  system  of  two  equations  in  (10.2)  yields 
the  same  estimates  as  OLS  on  each  equation  in  (10.1)  taken  separately.  If  p  is  large  we  expect 
gain  in  efficiency  in  performing  GLS  rather  than  OLS  on  (10.3).  In  this  case 

pGLS  =  (x'n-'xr'x'n-'y  (10.5) 


where  fl-1  =  E_1  0  It-  GLS  will  be  BLUE  for  the  system  of  two  equations  estimated  jointly. 
Note  that  we  only  need  to  invert  E  to  obtain  f2_1.  E  is  of  dimension  2x2  whereas,  Q  is  of 
dimension  2 T  x  2 T.  In  fact,  if  we  denote  by  E”1  =  [a*-7],  then 


Pgls 


'  auX[X  1 

a12X[X2  ' 

-1 

crnA'(yi  +  a12X[y2 

a21X'2X  1 

a22X'2X 2 

_  ^21X'2yi  +  a22X2y2  _ 

(10.6) 


Zellner  (1962)  gave  two  sufficient  conditions  where  it  does  not  pay  to  perform  GLS,  i.e. ,  GLS 
on  this  system  of  equations  turns  out  to  be  OLS  on  each  equation  separately.  These  are  the 
following: 

Case  1:  Zero  correlation  among  the  disturbances  of  the  i-th  and  j-th  equations,  i.e.,  avj  =  0  for 
i  ^  j.  This  means  that  E  is  diagonal  which  in  turn  implies  that  E-1  is  diagonal  with  au  =  1/au 
for  i  =  1,2,  and  cr*-7  =  0  for  i  /  j.  Therefore,  (10.6)  reduces  to 


Pgls 


'  <r11(A(X1)-1  0 

X[yi/au 

Pi, OLS 

0  a22{X'2X2)-1  _ 

X2y2/cT22 

P2  ,OLS 

(10.7) 


Case  2:  Same  regressors  across  all  equations.  This  means  that  all  the  X^s  are  the  same,  i.e., 

X\  =  X2  =  X*.  This  rules  out  different  number  of  regressors  in  each  equation  and  all  the  Xj’s 
must  have  the  same  dimension,  i.e.,  K\  =  I\2  =  K.  Hence,  X  =  I2  0  X*  and  (10.6)  reduces  to 

Pgls  =  [(l2®X*')(Z-1®IT)(I2®X*)}-1[(I2®X*')(Z-1®IT)y}  (10.8) 

=  [S  0  pr'X*)-1]^-1  0  X*')y\  =  [h  0  (X*' X*)-1  X*']y  =  pOLS 

These  results  generalize  to  the  case  of  M  regression  equations,  but  for  simplicity  of  exposition 
we  considered  the  case  of  two  equations  only. 

A  necessary  and  sufficient  condition  for  SUR(GLS)  to  be  equivalent  to  OLS,  was  derived  by 
Dwivedi  and  Srivastava  (1978).  An  alternative  derivation  based  on  the  Milliken  and  Albohali 
(1984)  necessary  and  sufficient  condition  for  OLS  to  be  equivalent  to  GLS,  is  presented  here, 
see  Baltagi  (1988).  In  Chapter  9,  we  saw  that  GLS  is  equivalent  to  OLS,  for  every  y ,  if  and  only 
if 


x'n-'Px  =  0  (10.9) 

In  this  case,  X  =  diag[Xj],  rU1  =  E-1  0  It ,  and  Px  =  diag[PxJ-  Hence,  the  typical  element 
of  (10.9),  see  problem  1,  is 

<TijX'PXj  =  0  (10.10) 

This  is  automatically  satisfied  for  i  =  j.  For  i  7^  j,  this  holds  if  alJ  =  0  or  X^Pxj  =  0.  Note 
that  o’2-7  =  0  is  the  first  sufficient  condition  provided  by  Zellner  (1962).  The  latter  condition 
X-  Pxj  =  0  implies  that  the  set  of  regressors  in  the  i-th  equation  are  a  perfect  linear  combination 
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of  those  in  the  j-th  equation.  Since  X) Px,  =  0  has  to  hold  also,  Xj  has  to  be  a  perfect  linear 
combination  of  the  regressors  in  the  z-th  equation.  Xj  and  Xj  span  the  same  space.  Both  X*  and 
Xj  have  full  column  rank  for  OLS  to  be  feasible,  hence  they  have  to  be  of  the  same  dimension 
for  X[P =  XjPxi  =  0.  In  this  case,  X-  =  CXj,  where  C  is  a  nonsingular  matrix,  i.e.,  the 
regressors  in  the  z-th  equation  are  a  perfect  linear  combination  of  those  in  the  j-th  equation. 
This  includes  the  second  sufficient  condition  derived  by  Zellner  (1962).  In  practice,  different 
economic  behavioral  equations  contain  different  number  of  right  hand  side  variables.  In  this 
case,  one  rearranges  the  SUR  into  blocks  where  each  block  has  the  same  number  of  right  hand 
side  variables.  For  two  equations  (z  and  j )  belonging  to  two  different  blocks  (z  /  j),  (10.10) 
is  satisfied  if  the  corresponding  al'J  is  zero,  i.e.,  X  has  to  be  block  diagonal.  However,  in  this 
case,  GLS  performed  on  the  whole  system  is  equivalent  to  GLS  performed  on  each  block  taken 
separately.  Hence,  (10.10)  is  satisfied  for  SUR  if  it  is  satisfied  for  each  block  taken  separately. 

Revankar  (1974)  considered  the  case  where  X2  is  a  subset  of  X\ .  In  this  case,  there  is  no  gain  in 
using  SUR  for  estimating  /32 ■  In  fact,  problem  2  asks  the  reader  to  verify  that  /32  sur  =  P2  ols- 
However,  this  is  not  the  case  for  j31.  It  is  easy  to  show  that  /31  SUR  =  /3X  OLS  —  Ae2,OLSi  where 
A  is  a  matrix  defined  in  problem  2,  and  e2 ,ols  are  the  OLS  residuals  for  the  second  equation. 

Telser  (1964)  suggested  an  iterative  least  squares  procedure  for  SUR  equations.  For  the  two 
equations  model  given  in  (10.1),  this  estimation  method  involves  the  following: 

1.  Compute  the  OLS  residuals  ei  and  e2  from  both  equations. 

2.  Include  e\  as  an  extra  regressor  in  the  second  equation  and  e2  as  an  extra  regressor 
in  the  first  equation.  Compute  the  new  least  squares  residuals  and  iterate  this  step  until 
convergence  of  the  estimated  coefficients.  The  resulting  estimator  has  the  same  asymptotic 
distribution  as  Zellner’s  (1962)  SUR  estimator. 

Conniffe  (1982)  suggests  stopping  at  the  second  step  because  in  small  samples  this  provides  most 
of  the  improvement  in  precision.  In  fact,  Conniffe  (1982)  argues  that  it  may  be  unnecessary  and 
even  disadvantageous  to  calculate  Zellner’s  estimator  proper.  Extensions  to  multiple  equations 
is  simple.  Step  1  is  the  same  where  one  computes  least  squares  residuals  of  every  equation.  Step 
2  adds  the  residuals  of  all  other  equations  in  the  equation  of  interest.  OLS  is  run  and  the  new 
residuals  are  computed.  One  can  stop  at  this  second  step  or  iterate  until  convergence. 


10.2  Feasible  GLS  Estimation 

In  practice,  X  is  not  known  and  has  to  be  estimated.  Zellner  (1962)  recommended  the  following 
feasible  GLS  estimation  procedure: 

sa  =  TdLi  eit/(T  ~  Ki)  for  z  =  1,2  (10.11) 

and 

sij  =  Yjt= 1  eaejt/ (T  -  Xj)1/2(T  -  Kj)1/2  for  z,  j  =  1,2  and  z  p  j  (10.12) 

where  eR  denotes  OLS  residuals  of  the  z-th  equation,  su  is  the  s2  of  the  regression  for  the  z-th 
equation.  This  is  unbiased  for  an.  However,  Sij  for  i  p  j  is  not  unbiased  for  aij.  In  fact,  the 
unbiased  estimate  is 

=  Ya= 1  eitejt/[T  -  Ki  -  Kj  +  tr  (L>)]  for 


hJ  =  1,2 


(10.13) 
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where  B  =  Xj_ ( X- ) ~ 1 X- Xj  ( X'- Xj ) ^:1  X'-  =  Px,  Px:j  ■  see  problem  4.  Using  this  last  estimator 
may  lead  to  a  variance-covariance  matrix  that  is  not  positive  definite.  For  consistency,  however, 
all  we  need  is  a  division  by  T,  however  this  leaves  us  with  a  biased  estimator: 

%  =  Yjt=ieitejt/T  for  *,.7  =  1,2  (10.14) 

Using  this  consistent  estimator  of  £  will  result  in  feasible  GLS  estimates  that  are  asymptoti¬ 
cally  efficient.  In  fact,  if  one  iterates  this  procedure,  i.e. ,  compute  feasible  GLS  residuals  and 
second  round  estimates  of  £  using  these  GLS  residuals  in  (10.14),  and  continue  iterating,  until 
convergence,  this  will  lead  to  maximum  likelihood  estimates  of  the  regression  coefficients,  see 
Oberhofer  and  Krnenta  (1974). 


Relative  Efficiency  of  OLS  in  the  Case  of  Simple  Regressions 

To  illustrate  the  gain  in  efficiency  of  Zellner’s  SUR  compared  to  performing  OLS  on  each 
equation  separately,  Kmenta(1986,  pp.  641-643)  considers  the  following  two  simple  regression 
equations: 

Lit  =  Pn  +  PwXit  +  u\t  (10.15) 

Ljt  =  (3 21  +  ^22^2 1  +  U2t  for  t  =  1, 2, . . .  T; 
and  proves  that 

var(3i2,GLs)/var(3i2,OLs)  =  (1  -  P2)/[!  ~  P2A  (10.16) 

where  p  is  the  correlation  coefficient  between  u\  and  U2 ,  and  r  is  the  sample  correlation  coefficient 
between  X\  and  X2.  Problem  5  asks  the  reader  to  verify  (10.16).  In  fact,  the  same  relative 
efficiency  ratio  holds  for  /32 2,  i.e.,  var(/322  GLs)/var(/^22  ols)  given  by  that  in  (10.16).  This 
confirms  the  two  results  obtained  above,  namely,  that  as  p  increases  this  relative  efficiency  ratio 
decreases  and  OLS  is  less  efficient  than  GLS.  Also,  as  r  increases  this  relative  efficiency  ratio 
increases  and  there  is  less  gain  in  performing  GLS  rather  than  OLS.  For  p  =  0  or  r  =  1,  the 
efficiency  ratio  is  1,  and  OLS  is  equivalent  to  GLS.  However,  if  p  is  large,  say  0.9  and  r  is  small, 
say  0.1  then  (10.16)  gives  a  relative  efficiency  of  0.11.  For  a  tabulation  of  (10.16)  for  various 
values  of  p2  and  r2,  see  Table  12-1  of  Krnenta  (1986,  p.  642). 


Relative  Efficiency  of  OLS  in  the  Case  of  Multiple  Regressions 


With  more  regressors  in  each  equation,  the  relative  efficiency  story  has  to  be  modified,  as 
indicated  by  Binkley  and  Nelson  (1988).  In  the  two  equation  model  considered  in  (10.2)  with 
K\  regressors  X\  in  the  first  equation  and  K2  regressors  X2  in  the  second  equation 


var  0gls)  =  (X'n^X)-1 


unA(A  1  a12X[X2 
a21X'2X  1  (t22X'2X2 


(10.17) 


If  we  focus  on  the  regression  estimates  of  the  first  equation,  we  get  var^  gls)  =  ^11  = 
[<j11A(Ai  —  o-12A'(A2(cr22A2A2)_1cr21A2Ai]^1  see  problem  6.  Using  the  fact  that 


s-1  =  [1/(1  -p2)] 


I/0-11 

-P2/0"21 


~P2lcr  12 
I/0-22 
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where  p 2  =  (r\2/ aua22i  one  gets 

var(3i ,gls)  =  kn(l  -  P2)]{X'i Xi  ~  p^XiPx.Xi)}-1  (10.18) 

Add  and  subtract  p2X[X i  from  the  expression  to  be  inverted,  one  gets 

var(3i, gls)  =  *n{*i*i  +  [p2/(  1  -  P2)}E’E}-x  (10.19) 

where  E  =  Px2X i  is  the  matrix  whose  columns  are  the  OLS  residuals  of  each  variable  in 
X\  regressed  on  X2-  If  E  =  0,  there  is  no  gain  in  SUR  over  OLS  for  the  estimation  of  /31. 
X\  =  X2  or  X\  is  a  subset  of  X2  are  two  such  cases.  One  can  easily  verify  that  (10.19)  is  the 
variance-covariance  matrix  of  an  OLS  regression  with  regressor  matrix 


where  62  =  p2/(  1  —  p2).  Now  let  us  focus  on  the  efficiency  of  the  estimated  coefficient  of  the 
17- th  variable,  Xq  in  X\ .  Recall,  from  Chapter  4,  that  for  the  regression  of  y  on  X\ 

var 0q,oLs)  =  {Eh  xtq0-  ~  R2g)}  (10.20) 

where  the  denominator  is  the  residual  sum  of  squares  of  Xq  on  the  other  (K\  —  1)  regressors  in 
X\  and  R2  is  the  corresponding  R2  of  that  regression.  Similarly,  from  (10.19), 

Var(Pq,SUR)  =  °11 /  Xtq  +  Yjt= 1  etg}  (1  —  R*q  )  (10.21) 


where  the  denominator  is  the  residual  sum  of  squares  of 


0eo 


on  the  other  ( K\  —  1)  regressors 


in  W,  and  R*2  is  the  corresponding  R2  of  that  regression.  Add  and  subtract  Ylt=i  xtq (1  —  Rq) 
to  the  denominator  of  (10.21),  we  get 

var(/5g,sc/i?)  = 


(Til 


(  Eh  -  Rl )  +  Eh  XURq  ~  Rf)  +  ^  Eh  4(1  ~  R, 


(10.22) 


This  variance  differs  from  (10-20)  by  the  two  extra  terms  in  the  denominator.  If 

p  =  0,  then  92  =  0,  so  that  W’  =  [X{,  0]  and  R2  =  R*2.  In  this  case,  (10.22)  reduces  to  (10.20). 
If  Xq  also  appears  in  the  second  equation,  or  in  general  is  spanned  by  the  variables  in  X2,  then 
etq  =  0;  Et= 1  etq  =  0  and  from  (10.22)  there  is  gain  in  efficiency  only  if  R2  >  R*2.  R2  is  a 
measure  of  multicollinearity  of  Xq  with  the  other  (K\  —  1)  regressors  in  the  first  equation,  i.e.,  X\ . 
If  this  is  high,  then  it  is  more  likely  for  R2  >  R*2 .  Therefore,  the  higher  the  multicollinearity 
within  X\,  the  greater  the  potential  for  a  decrease  in  variance  of  OLS  by  SUR.  Note  that 
R2  =  R*2  when  0E  =  0.  This  is  true  if  6  =  0,  or  E  =  0.  The  latter  occurs  when  X±  is 
spanned  by  the  sub-space  of  X2.  Problem  7  asks  the  reader  to  verify  that  R2  =  R*2  when  X\ 
is  orthogonal  to  X2.  Therefore,  with  more  regressors  in  each  equation,  one  has  to  consider  the 
correlation  between  the  X’s  within  each  equation  as  well  as  that  across  equations.  Even  when 
the  X’s  across  equations  are  highly  correlated,  there  may  still  be  gains  from  joint  estimation 
using  SUR  when  there  is  high  mulicollinearity  within  each  equation. 
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10.3  Testing  Diagonality  of  the  Variance-Covariance  Matrix 

Since  the  diagonality  of  £  is  at  the  heart  of  using  SUR  estimation  methods,  it  is  important 
to  look  at  tests  for  Hq\  £  is  diagonal.  Breusch  and  Pagan  (1980)  derived  a  simple  and  easy 
to  use  Lagrange  multiplier  statistic  for  testing  Ho-  This  is  based  upon  the  sample  correlation 
coefficients  of  the  OLS  residuals: 

LM  =  T  £j=i  rij  (10-23) 

where  M  denotes  the  number  of  equations  and  rl3  =  %j / (su'sjj)1/2 ■  The  s)j’s  are  computed 
from  OLS  residuals  as  in  (10.14).  Under  the  null  hypothesis,  Xlm  has  an  asymptotic  Xm(m- i)/2 
distribution.  Note  that  the  stj ’s  are  needed  for  feasible  GLS  estimation.  Therefore,  it  is  easy 
to  compute  the  r^’s  and  Xlm  by  summing  the  squares  of  half  the  number  of  off-diagonal 
elements  of  R  =  [rij]  and  multiplying  the  sum  by  T.  For  example,  for  the  two  equations  case, 
Xlm  =  Tr 21  which  is  asymptotically  distributed  as  Xi  under  Ho-  For  the  three  equations  case, 
Xlm  =  T{r 21  +  r'3,  +  rf2)  which  is  asymptotically  distributed  as  under  Ho- 

Alternatively,  the  Likelihood  Ratio  test  can  also  be  used  to  test  for  diagonality  of  £.  This 
is  based  on  the  determinants  of  the  variance  covariance  matrices  estimated  by  MLE  for  the 
restricted  and  unrestricted  models: 

Xlr  =  T  (£"  1  logs**  -  log|£|)  (10.24) 

where  sa  is  the  restricted  MLE  of  atl  obtained  from  the  OLS  residuals  as  in  (10.14).  The  matrix 
£  denotes  the  unrestricted  MLE  of  £.  This  may  be  adequately  approximated  with  an  estimator 
based  on  the  feasible  GLS  estimator  PfglSi  see  Judge  et  al.  (1982).  Under  Ho,  Xlr  has  an 
asymptotic  X2m(m-i)/2  distribution. 


10.4  Seemingly  Unrelated  Regressions  with  Unequal  Observations 


Srivastava  and  Dwivedi  (1979)  surveyed  the  developments  in  the  SUR  model  and  described 
the  extensions  of  this  model  to  the  serially  correlated  case,  the  nonlinear  case,  the  misspecified 
case,  and  that  with  unequal  number  of  observations.  Srivastava  and  Giles  (1988)  dedicated  a 
monograph  to  SUR  models,  and  surveyed  the  finite  sample  as  well  as  asymptotic  results.  More 
recently,  Fiebig  (2001)  gives  a  concise  and  up  to  date  account  of  research  in  this  area.  In  this 
section,  we  consider  one  extension  to  focus  upon.  This  is  the  case  of  SUR  with  unequal  number 
of  observations  considered  by  Schmidt  (1977),  Baltagi,  Garvin  and  Kerman  (1989)  and  Hwang 
(1990). 

Let  the  first  firm  have  T  observations  common  with  the  second  firm,  but  allow  the  latter  to 
have  N  extra  observations.  In  this  case,  (10.2)  will  have  y\  of  dimension  T  x  1  whereas  1/2  will 
be  of  dimension  (T  +  N)  x  1.  In  fact,  y2  =  (IJ2  iVH)  an<J  -^2  =  (Jf2/,  X2)  with  *  denoting  the 
T  common  observations  for  the  second  firm,  and  o  denoting  the  extra  N  observations  for  the 
second  firm.  The  disturbances  will  now  have  a  variance-covariance  matrix 


n  = 


cii-Zt  U12/T  0 

U12IT  022  It  0 
0  0  022IN 


(10.25) 
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GLS  on  (10.2)  will  give 

O  _  f  a12X[X*  l'1 

fj°LS  ~  [al2X^'X  i  a22X%XZ  +  (X%X$)/a22) 

(10.26) 

vUX[yi  +  (rl2X[y% 

a12X*'yi  +  a22X*'y*  +  {X°'y°)/a22) 

where  £_1  =  [cr*J]  for  i,j  =  1,2.  If  we  run  OLS  on  each  equation  (T  for  the  first  equation,  and 
T  +  N  for  the  second  equation)  and  denote  the  residuals  for  the  two  equations  by  ei  and  e2, 
respectively,  then  we  can  partition  the  latter  residuals  into  e!2  =  (e2,e2).  In  order  to  estimate 
O,  Schmidt  (1977)  considers  the  following  procedures: 

(1)  Ignore  the  extra  N  observations  in  estimating  H.  In  this  case 

Shi  =  «ii  =  eiei/T;  a12  =  s12  =  e[e2/T  and  a22  =  s22  =  e2  e*2/T  (10.27) 

(2)  Use  T  +  N  observations  to  estimate  a22.  In  other  words,  use  sn,  S12  and  <722  =  s22  = 
e2e2/(T  +  N ).  This  procedure  is  attributed  to  Wilks  (1932)  and  has  the  disadvantage  of 
giving  estimates  of  H  that  are  not  positive  definite. 

(3)  Use  sn  and  S22,  but  modify  the  estimate  of  <J\2  such  that  II  is  positive  definite.  Srivastava 
and  Zaatar  (1973)  suggest  a  12  =  'Sl2(s22/s22)1/,2• 

(4)  Use  all  (T  +  N)  observations  in  estimating  Q.  Hocking  and  Smith  (1968)  suggest  using 
<7n  =  sn  -  ( N/N  +  T)(si2/s22)2(s22  -  s22)  where  s22  =  e2e2/N;  a12  =  s12(s22/s22 )  and 
&22  =  s22. 

(5)  Use  a  maximum  likelihood  procedure. 

All  estimators  of  H  are  consistent,  and  /3FGLy  based  on  any  of  these  estimators  will  be  asymp¬ 
totically  efficient.  Schmidt  considers  their  small  sample  properties  by  means  of  Monte  Carlo 
experiments.  Using  the  set  up  of  Kmenta  and  Gilbert  (1968)  he  finds  for  T  =  10,20,50  and 
N  =  5, 10,  20  and  various  correlation  of  the  A’s  and  the  disturbances  across  equations  the  fol¬ 
lowing  disconcerting  result:  “..it  is  certainly  remarkable  that  procedures  that  essentially  ignore 
the  extra  observations  in  estimating  £  (e.g.,  Procedure  1)  do  not  generally  do  badly  relative  to 
procedures  that  use  the  extra  observations  fully  (e.g.,  Procedure  4  or  MLE).  Except  when  the 
disturbances  are  highly  correlated  across  equations,  we  may  as  well  just  forget  about  the  extra 
observations  in  estimating  £.  This  is  not  an  intuitively  reasonable  procedure.” 

Hwang  (1990)  re-par ametrizes  these  estimators  in  terms  of  the  elements  of  £_1  rather  than  £. 
After  all,  it  is  XU1  rather  than  £  that  appears  in  the  GLS  estimator  of  f3.  This  re-parametrization 
shows  that  the  estimators  of  £_1  no  longer  have  the  ordering  in  terms  of  their  use  of  the  extra 
observations  as  that  reported  by  Schmidt  (1977).  However,  regardless  of  the  parametrization 
chosen,  it  is  important  to  point  out  that  all  the  observations  are  used  in  the  estimation  of 
j3  whether  at  the  first  stage  for  obtaining  the  least  squares  residuals,  or  in  the  final  stage  in 
computing  GLS.  Baltagi  et  al.  (1989)  show  using  Monte  Carlo  experiments  that  better  estimates 
of  £  or  its  inverse  £_1  in  Mean  Square  Error  sense,  do  not  necessarily  lead  to  better  GLS 
estimates  of  /3. 
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10.5  Empirical  Examples 


Example  1:  Baltagi  and  Griffin  (1983)  considered  the  following  gasoline  demand  equation: 


l  Gas  ,  a  l  Y  ,  a  l  Pmg  ,  a  l  C’or  , 

logCV  =  “  +  V  +  +  U 


where  Gas/Car  is  motor  gasoline  consumption  per  auto,  Y/N  is  real  per  capita  income, 
Pmg/Pgdp  is  real  motor  gasoline  price  and  Car/N  denotes  the  stock  of  cars  per  capita. 
This  data  consists  of  annual  observations  across  18  OECD  counties,  covering  the  period  1960- 
1978.  It  is  provided  as  GASOLINE.DAT  on  the  Springer  web  site.  We  consider  the  first  two 
countries:  Austria  and  Belgium.  OLS  on  this  data  yields 


Austria 

,  Gas 
l0eCar 

Belgium 

Gas 

logcw 

,  ,  y  ,  Pmg  ,  Car 

3.727  +  0.761  log—  -  0.793  log- -  0.520  log-- 

I V  Jr  a  DP  1 V 

(0.373)  (0.211)  (0.150)  (0.113) 

.  Y  .  Pmg  ,  Car 

3.042  +  0.845  log—  -  0.042  log- -  0.673  log-— 

1 V  Jr  a  np  1 V 

(0.453)  (0.170)  (0.158)  (0.093) 


where  the  standard  errors  are  shown  in  parentheses.  Based  on  these  OLS  residuals,  the  estimate 
of  £  is  given  by 


0.0012128  0.00023625 
0.00092367 


The  Seemingly  Unrelated  Regression  estimates  based  on  this  £,  i.e. ,  after  one  iteration,  are 
given  by 


Austria 

Gas 

i0gCar 

Belgium 

Gas 

l°gCar 

3.713  + 

(0.372)  (0.209) 


,  Y 

0.721  log—  - 


2.843  + 

(0.445)  (0.170) 


Y 


0.835  log—  - 


0.754  log 
(0.146) 

0.131  log 
(0.154) 


Pmg 

Pgdp 

Pmg 

Pgdp 


-  0.496  log 

(0.111) 

-  0.686  log 
(0.093) 


Car 
1 V~ 


Car 
1 V" 


The  Breusch-Pagan  (1980)  Lagrange  multiplier  test  for  diagonality  of  £  is  TV^  =  0.947  which  is 
distributed  as  xi  under  the  null  hypothesis.  The  Likelihood  Ratio  test  for  the  diagonality  of  £, 
given  in  (10.23),  yields  a  value  of  1.778  which  is  also  distributed  as  xj  under  the  null  hypothesis. 
Both  test  statistics  do  not  reject  Ho.  These  SUR  results  were  run  using  SHAZAM  and  could  be 
iterated  further.  Note  the  reduction  in  the  standard  errors  of  the  estimated  regression  coefficients 
is  minor  as  we  compare  the  OLS  and  SUR  estimates. 

Suppose  that  we  only  have  the  first  15  observations  (1960-1974)  on  Austria  and  all  19  ob¬ 
servations  (1960-1978)  on  Belgium.  We  now  apply  the  four  feasible  GLS  procedures  described 
by  Schmidt  (1977).  The  first  procedure  which  ignores  the  extra  4  observations  in  estimating  £ 
yields  sn  =  0.00086791,  s\2  =  0.00026357  and  S22  =  0.00109947  as  described  in  (10.27).  The 
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resulting  SUR  estimates  are  given  by 

.  y  ,  Pmg  i  Car 

=  4.484  +  0.817  log—  -  0.580  log- -  0.487  log-— 

1 V  JTQ  DP  1\ 

(0.438)  (0.168)  (0.176)  (0.098) 

,  y  ,  Pmg  ,  Car 

=  2.936  +  0.848  log—  -  0.095  log- -  0.686  log-— 

rn  dp 

(0.436)  (0.164)  (0.151)  (0.090) 

The  second  procedure,  due  to  Wilks  (1932)  uses  the  same  sp  and  S12  in  procedure  1,  but 
(T22  =  S22  =  e'2e2/19  =  0.00092367.  The  resulting  SUR  estimates  are  given  by 


Austria 

.  Gas 
‘°gCor 

Belgium 

.  Gas 
‘°gCor 

.  ,  .  ,  Gas  .  Y  .  Pmg  ,  Car 

Austria  log-^—  =  4.521  +  0.806  log—  -  0.554  log— — -  -  0.476  log— — 

Car  N  Pgdp  A 

(0.437)  (0.167)  (0.174)  (0.098) 

„  ,  .  ,  Gas  .  Y  ,  Pmg  ,  Car 

Belgium  log  —  =  2.937  +  0.848  log—  -  0.094  log— — -  -  0.685  log— — 

Car  N  Pgdp  A 

(0.399)  (0.150)  (0.138)  (0.082) 

The  third  procedure  based  on  Srivastava  and  Zaatar  (1973)  use  the  same  sp  and  s 22  as  proce¬ 
dure  2,  but  modify  <712  =  512(522/  S22)1/2  =  0.00024158.  The  resulting  SUR  estimates  are  given 
by 


Austria 

.  Gas 
l0gCar 

Belgium 

.  Gas 
,0gCar 

=  4.503 

(0.438) 

=  2.946 

(0.400) 


,  Y 

+  0.812  log—  - 


,  Pmg  1  Car 

0.567  log- - 0.481  log- 


(0.168) 


(0.176) 


Pgdp 


(0.098) 


N 


Y 


+  0.847  log—  - 
(0.151) 


,  Pmg  ,  Car 

0.090  log— — 0.684  log- 


(0.139) 


’  Pgdp 


(0.082) 


N 


The  fourth  procedure  due  to  Hocking  and  Smith  (1968)  yields  ffp  =  0.00085780,  a\2  = 
0.0022143  and  <722  =  S22  =  0.00092367.  The  resulting  SUR  estimates  are  given  by 

,  ,  ■  ,  Gas  .  Y  ,  Pmg  ,  Car 

Austria  log--  =  4.485  +  0.817  log—  -  0.579  log-——  -  0.487  log^r^ 

Car  N  Pgdp  A 

(0.437)  (0.168)  (0.176)  (0.098) 

.  .  ,  Gas  ,  .  Y  .  Pmg  ,  Car 

Belgium  log-; —  =  2.952  +  0.847  log—  -  0.086  log^^ — - -  0.684  log— — 

Car  N  Pgdp  A 

(0.400)  (0.151)  (0.139)  (0.082) 


In  this  case,  there  is  not  much  difference  among  these  four  alternative  estimates. 


Example  2:  Growth  and  Inequality.  Lundberg  and  Squire  (2003)  estimate  a  two  equation  model 
of  growth  and  inequality  using  SUR.  The  first  equation  relates  Growth  (dly)  to  education  (adult 
years  schooling:  yrt),  the  share  of  government  consumption  in  GDP  (gov),  M2/GDP  (m2y), 
Inflation  (inf),  Sachs- Warner  measure  of  openness  (swo),  changes  in  the  terms  of  trade  (dtot), 
initial  income  (f_pcy),  dummy  for  1980s  (d80)  and  dummy  for  1990s  (d90).  The  second  equation 
relates  the  Gini  coefficient  (gih)  to  education,  M2/GDP,  civil  liberties  index  (civ),  mean  land 
Gini  (mlg),  mean  land  Gini  interacted  with  a  dummy  for  developing  countries  (mlgldc).  The 
data  contains  119  observations  for  38  countries  over  the  period  1965-1990,  and  can  be  obtained 
from  http:/ /www.res. org.uk/economic/datasets/datasetlist. asp. 
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Table  10.1  Growth  and  Inequality:  SUR  Estimates 


.  sureg  (Growth:  dly  =  yrt  gov  m2y  inf  swo 
(Inequality:  gih  =  yrt  m2y  civ  mlg  mlgldc), 

Seemingly  unrelated  regression 

dtot  Lpcy  d80  d90) 

corr 

Equation 

Obs 

Parms 

RMSE 

“R-sq” 

chi2 

P 

Growth 

119 

9 

2.313764 

0.4047 

80.36 

0.0000 

Inequality 

119 

5 

6.878804 

0.4612 

102.58 

0.0000 

Coef. 

Std.  Err. 

z 

P>\z\ 

[95%  Conf.  Interval] 

Growth 

yrt 

-.0497042 

.1546178 

-0.32 

0.748 

-.3527496 

.2533412 

gov 

-.0345058 

.0354801 

-0.97 

0.331 

-.1040455 

.0350338 

m2y 

.0084999 

.0163819 

0.52 

0.604 

-.023608 

.0406078 

inf 

-.0020648 

.0013269 

-1.56 

0.120 

-.0046655 

.000536 

SWO 

3.263209 

.60405 

5.40 

0.000 

2.079292 

4.447125 

dtot 

17.74543 

21.9798 

0.81 

0.419 

-25.33419 

60.82505 

f-pcy 

-1.038173 

.4884378 

-2.13 

0.034 

-1.995494 

-.0808529 

d80 

-1.615472 

.5090782 

-3.17 

0.002 

-2.613247 

-.6176976 

d90 

-3.339514 

.6063639 

-5.51 

0.000 

-4.527965 

-2.151063 

_cons 

10.60415 

3.471089 

3.05 

0.002 

3.800944 

17.40736 

Inequality 

yrt 

-1.000843 

.3696902 

-2.71 

0.007 

-1.725422 

-.2762635 

m2y 

-.0570365 

.0471514 

-1.21 

0.226 

.1494516 

.0353785 

civ 

.0348434 

.5533733 

0.06 

0.950 

-1.049748 

1.119435 

mlg 

.1684692 

.0625023 

2.70 

0.007 

.0459669 

.2909715 

mlgldc 

.0344093 

.0421904 

0.82 

0.415 

-.0482823 

.117101 

_cons 

33.96115 

4.471626 

7.59 

0.000 

25.19693 

42.72538 

Correlation  matrix  of  residuals 

Growth 

inequality 

Growth 

1.0000 

Inequality 

0.0872 

1.0000 

Breusch-Pagan  test  of  independence:  chi2(l) 

=  0.905,  Pr  = 

0.3415. 

Table  10.1  gives  the  SUR  estimates  reported  in  Table  1  of  Lundberg  and  Squire  (2003,  p. 
334)  using  the  sureg  command  in  Stata.  Among  other  things,  these  results  show  that  openness 
enhances  growth  and  education  reduces  inequality.  The  correlation  among  the  residuals  of  the 
two  equations  is  weak  (0.0872)  and  the  Breusch-Pagan  test  for  diagonality  of  the  variance- 
covariance  matrix  of  the  disturbances  of  the  two  equations  is  statistically  insignificant,  not 
rejecting  zero  correlation  among  the  two  equations. 
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Problems 

1.  When  Is  OLS  as  Efficient  as  Zellner’s  SUR? 

(a)  Show  that  OLS  on  a  system  of  two  Zellner’s  SUR  equations  given  in  (10.2)  is  the  same  as  OLS 
on  each  equation  taken  separately.  What  about  the  estimated  variance-covariance  matrix  of 
the  coefficients?  Will  they  be  the  same? 

(b)  In  the  General  Linear  Model,  we  found  a  necessary  and  sufficient  condition  for  OLS  to  be 
equivalent  to  GLS  is  that  X'Cl~1Px  =  0  for  every  y  where  Px  =  I  —  Px-  Show  that  a  neces¬ 
sary  and  sufficient  condition  for  Zellner’s  GLS  to  be  equivalent  to  OLS  is  that  a^X^Pxj  =  0 
for  i  ^  j  as  described  in  (10.10).  This  is  based  on  Baltagi  (1988). 

(c)  Show  that  the  two  sufficient  conditions  given  by  Zellner  for  SUR  to  be  equivalent  to  OLS 
both  satisfy  the  necessary  and  sufficient  condition  given  in  part  (b). 

(d)  Show  that  if  Xj  =  XjC'  where  C  is  an  arbitrary  nonsingular  matrix,  then  the  necessary  and 
sufficient  condition  given  in  part  (b)  is  satisfied. 

2.  What  Happens  to  Zellner’s  SUR  Estimator  when  the  Set  of  Regressors  in  One  Equation  Are  a 
Subset  of  Those  in  the  Second  Equation?  Consider  the  two  SUR  equations  given  in  (10.2).  Let 
Xi  =  (X2,Xe),  i.e.,  X2  is  a  subset  of  X\.  Prove  that 

(a)  l32 

,SUR  —  @2, OLS' 

(b)  Pi, SUR  =  Pi, ols  -  Ae2,OLS,  where  A  =  s12(X[X1)-1X,1/'s22 ■  e2,0LS  are  the  OLS  residuals 
from  the  second  equation,  and  the  s^’s  are  defined  in  (10.14). 

3.  What  Happens  to  Zellner’s  SUR  Estimator  when  the  Set  of  Regressors  in  One  Equation  Are  Or¬ 
thogonal  to  Those  in  the  Second  Equation?  Consider  the  two  SUR  equations  given  in  (10.2).  Let 
Xi  and  X2  be  orthogonal,  i.e.,  X[X2  =  0.  Show  that  knowing  the  true  E  we  get 


(a)  Pi, gls  ~  Pi, ols  +  "1°  X'iy2  and  P2,gls  ~  P2,ols  +  /a  X-^2-^2) 

X'2yi. 

(b)  What  are  the  variances  of  these  estimates? 

(c)  If  X\  and  X2  are  single  regressors,  what  are  the  relative  efficiencies  of  (3i  OLS  with  respect 
t0  Pi, gls  for  *  =  1,  2? 

4.  An  Unbiased  Estimate  of  Uij.  Verify  that  's-ij ,  given  in  (10.13),  is  unbiased  for  <iij.  Note  that  for 
computational  purposes  tr(R)  =  tr^Px^x,)- 

5.  Relative  Efficiency  of  OLS  in  the  Case  of  Simple  Regressions.  This  is  based  on  Kmenta  (1986,  pp. 
641-643).  For  the  system  of  two  equations  given  in  (10.15),  show  that 


(a)  var (P12,ols)  =  <ni/mXlXl  and  var(/?220iS)  =  a22lmx 
Xf){: Kjt-Xj)  for  i,j  =  1,2. 


(b)  var 


where  mx 


1  Pl2,GLS 

=  (erncr22  -  a\2) 

(J  T2^X\X\ 

O' \2Olx\X2 

V  P22,GLS  ) 

( 7 \2^XiX2 

0 1101x2X2 

X2X2,,0XlX1 


’  12,,l'XiX2\ 


Deduce  that  var(3i2, gls)  =  (<^11^22  ~  crl2)criimX2X2/l<Jiicr22mX2 

var  22, gls')  —  (crnfJ22  —  u\2)(T22mXlXl/[(Tiia22mXlXlrnX2X2  —  u\2m2XlX  J. 

(c)  Using  p  =  <Ji2/{<7u(J22)lt2  and  r  =  raIll2/ {mXxXimX2X2)x I2  and  the  results  in  parts  (a)  and 
(b),  show  that  (10.16)  holds,  i.e.,  var(3i2)GLs)/var(3i2)OLs)  =  (x  -  P2)/[  1  -  P2r2}- 
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(d)  Differentiate  (10.16)  with  respect  to  9  =  p2  and  show  that  (10.16)  is  a  non-increasing  function 
of  6.  Similarly,  differentiate  (10.16)  with  respect  to  A  =  r2  and  show  that  (10.16)  is  a  non¬ 
decreasing  function  of  A.  Finally,  compute  this  efficiency  measure  (10.16)  for  various  values 
of  p 2  and  r2  between  0  and  1  at  0.1  intervals,  see  Kmenta’s  (1986)  Table  12-1,  p.  642. 

6.  Relative  Efficiency  of  OLS  in  the  Case  of  Multiple  Regressions.  This  is  based  on  Binkley  and 
Nelson  (1988).  Using  partitioned  inverse  formulas,  verify  that  var(/31  gls)  =  -^li  given  below 
(10.17).  Deduce  (10.18)  and  (10.19). 

7.  Consider  the  multiple  regression  case  with  orthogonal  regressors  across  the  two  equations,  i.e., 
X\ X‘i  =  0.  Verify  that  R2  =  R*2,  where  R2  and  R*2  are  defined  below  (10.20)  and  (10.21), 
respectively. 

8.  (a)  SUR  With  Unequal  Number  of  Observations.  This  is  based  on  Schmidt  (1977).  Derive  the 

GLS  estimator  for  SUR  with  unequal  number  of  observations  given  in  (10.26). 

(b)  Show  that  if  a  12  =  0,  SUR  with  unequal  number  of  observations  reduces  to  OLS  on  each 
equation  separately. 

9.  Grunfeld  (1958)  considered  the  following  investment  equation: 

lit  =  Oi  +  PiFu  +  p2Cit  +  Ua 

where  /,t  denotes  real  gross  investment  for  firm  i  in  year  t,  Fa  is  the  real  value  of  the  firm  (shares 
outstanding)  and  Ca  is  the  real  value  of  the  capital  stock.  This  data  set  consists  of  10  large  U.S. 
manufacturing  firms  over  20  years,  1935-1954,  and  are  given  in  Boot  and  de  Witt  (1960).  It  is 
provided  as  GRUNFELD.DAT  on  the  Springer  web  site.  Consider  the  first  three  firms:  G.M.,  U.S. 
Steel  and  General  Electric. 


(a)  Run  OLS  of  I  on  a  constant,  F  and  C  for  each  of  the  3  firms  separately.  Plot  the  residuals 
against  time.  Print  the  variance-covariance  of  the  estimates. 

(b)  Test  for  serial  correlation  in  each  regression. 

(c)  Run  Seemingly  Unrelated  Regressions  (SUR)  for  the  first  two  firms.  Compare  with  OLS. 

(d)  Run  SUR  for  the  three  assigned  firms.  Compare  these  results  with  those  in  part  (c). 

(e)  Test  for  the  diagonality  of  E  across  these  three  equations. 

(f)  Test  for  the  equality  of  all  coefficients  across  the  3  firms. 

10.  (Continue  Problem  9).  Consider  the  first  two  firms  again  and  focus  on  the  coefficient  of  F.  Refer 
to  the  Binkley  and  Nelson  (1988)  article  in  The  American  Statistician,  and  compute  R2,  R*2 ,  Se2q 
and  E x2q. 

(a)  What  would  be  equations  (10.20)  and  (10.21)  for  your  data  set? 

(b)  Substitute  estimates  of  an  and  62  and  verify  that  the  results  are  the  same  as  those  obtained 
in  problems  9(a)  and  9(c). 

(c)  Compare  the  results  from  equations  (10.20)  and  (10.21)  in  part  (a).  What  do  you  conclude? 

11.  (Continue  Problem  9).  Consider  the  first  two  firms  once  more.  Now  you  only  have  the  first  15 
observations  on  the  first  firm  and  all  20  observations  on  the  second  firm.  Apply  Schmidt’s  (1977) 
feasible  GLS  estimators  and  compare  the  resulting  estimates. 


12.  For  the  Baltagi  and  Griffin  (1983)  Gasoline  Data  considered  in  section  10.5,  the  model  is 


1  Gas  Y  PMG 

logc^  =  “  +  Alogw+AlogiW 


R  1  Car 

/?3log  — 


where  Gas/Car  is  motor  gasoline  consumption  per  auto,  Y /N  is  real  per  capita  income,  Pmg/Pgdp 
is  real  motor  gasoline  price  and  Car/N  denotes  the  stock  of  cars  per  capita. 
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(a)  Run  Seemingly  Unrelated  Regressions  (SUR)  for  the  first  two  countries.  Compare  with  OLS. 

(b)  Run  SUR  for  the  first  three  countries.  Comment  on  the  results  and  compare  with  those  of 
part  (a).  (Are  there  gains  in  efficiency?) 

(c)  Test  for  Diagonality  of  E  across  the  three  equations  using  the  Breusch  and  Pagan  (1980)  LM 
test  and  the  Likelihood  Ratio  test. 

(d)  Test  for  the  equality  of  all  coefficients  across  the  3  countries. 

(e)  Consider  the  first  2  countries  once  more.  Now  you  only  have  the  first  15  observations  on  the 
first  country  and  all  19  observations  on  the  second  country.  Apply  Schmidt’s  (1977)  feasible 
GLS  estimators,  and  compare  the  results. 

13.  Trace  Minimization  of  Singular  Systems  with  Cross-Equation  Restrictions.  This  is  based  on  Baltagi 
(1993).  Berndt  and  Savin  (1975)  demonstrated  that  when  certain  cross-equation  restrictions  are 
imposed,  restricted  least  squares  estimation  of  a  singular  set  of  SUR  equations  will  not  be  invariant 
to  which  equation  is  deleted.  Consider  the  following  set  of  three  equations  with  the  same  regressors: 

Vi  =  otiiT  +  faX  +  ei  i  =  1,2,3. 

where  yi  =  (?/;i,  Vi2,  •  •  • ,  Vir)' ,  X  =  (aq,  x2,  ■  ■  ■ ,  xT)' ,  and  for  (i  =  1,2,3)  are  T  x  1  vectors 
and  lt  is  a  vector  of  ones  of  dimension  T.  and  Pi  are  scalars,  and  these  equations  satisfy  the 
adding  up  restriction  Xu=i  Vn  =  1  for  every  t  =  1,  2, . . . ,  T.  Additionally,  we  have  a  cross-equation 
restriction:  (31  =  j32. 

(a)  Denote  the  unrestricted  OLS  estimates  of  /3i  by  bi  where  bi  =  J2t=i(xt~ x)Uit/ Ylt—i(xt ~ x)2 
for  i  =  1,2,3,  and  x  =  Y^t-ixt/T-  Show  that  these  unrestricted  bfs  satisfy  the  adding  up 
restriction  (31  +  (32  +  @3  =  0  on  the  true  parameters  automatically. 

(b)  Show  that  if  one  drops  the  first  equation  for  i  =  1  and  estimate  the  remaining  system  by 
trace  minimization  subject  to  (3l  =  /32,  one  gets  f31  =  0.46i  +  0.662. 

(c)  Now  drop  the  second  equation  for  i  =  2,  and  show  that  estimating  the  remaining  system  by 
trace  minimization  subject  to  f3l  =  /32,  gives  =  0.6&!  +  0.462. 

(d)  Finally,  drop  the  third  equation  for  i  =  3,  and  show  that  estimating  the  remaining  system 
by  trace  minimization  subject  to  /31  =  (32  gives  /3X  =  0.56i  +  0.562. 

Note  that  this  also  means  the  variance  of  (3l  is  not  invariant  to  the  deleted  equation.  Also,  this 
non-invariancy  affects  Zellner’s  SUR  estimation  if  the  restricted  least  squares  residuals  are  used 
rather  than  the  unrestricted  least  squares  residuals  in  estimating  the  variance  covariance  matrix 
of  the  disturbances.  Hint:  See  the  solution  by  Im  (1994). 

14.  For  the  Natural  Gas  data  considered  in  Chapter  4,  problem  16.  The  model  is 

log  Consn  =  P0  +  P^ogPgu  +  /32logPoit  +  /33logPeit  +  /34log  HDDit 
+(35logPIit  +  ult 

where  i  =  1,  2, . . . ,  6  states  and  i  =  1,2,...,  23  years. 

(a)  Run  Seemingly  Unrelated  Regressions  (SUR)  for  the  first  two  states.  Compare  with  OLS. 

(b)  Run  SUR  for  all  six  states.  Comment  on  the  results  and  compare  with  those  of  part  (a).  (Are 
there  gains  in  efficiency?) 

(c)  Test  for  Diagonality  of  S  across  the  six  states  using  the  Breusch  and  Pagan  (1980)  LM  test 
and  the  Likelihood  Ratio  test. 

(d)  Test  for  the  equality  of  all  coefficients  across  the  six  states. 
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15.  Equivalence  of  LR  Test  and  Hausman  Test.  This  is  based  on  Qian  (1998).  Suppose  that  we  have 
the  following  two  equations: 

Vgt  —  ex.g  T  Ugt  g  —  1>  2,  t  —  1,2, . .  .T 

where  (rtit,  U2t)  is  normally  distributed  with  mean  zero  and  variance  f 1  =  U  (g)  Lp  where  T,  =  [crffs] 
for  g,  s  =  1, 2.  This  is  a  simple  example  of  the  same  regressors  across  two  equations. 

(a)  Show  that  the  OLS  estimator  of  ag  is  the  same  as  the  GLS  estimator  of  ag  and  both  are 
equal  to  yg  =  J2t=i  ygt/T  for  9  =  U  2. 

(b)  Derive  the  maximum  likelihood  estimators  of  ag  and  ogs  for  g,s,=  1,2.  Compute  the  log- 
likelihood  function  evaluated  at  these  unrestricted  estimates. 

(c)  Compute  the  maximum  likelihood  estimators  of  ag  and  <rgs  for  g,  s  =  1,2  under  the  null 
hypothesis  Hq]  o\\  =  o22. 

(d)  Using  parts  (b)  and  (c)  compute  the  LR  test  for  H0;  on  =  o22. 

(e)  Show  that  the  LR  test  for  Hq  derived  in  part  (c)  is  asymptotically  equivalent  to  the  Hausman 
test  based  on  the  difference  in  estimators  obtained  in  parts  (b)  and  (c).  Hausman ’s  test  is 
studied  in  Chapter  12. 

16.  Estimation  of  a  Triangular,  Seemingly  Unrelated  Regression  System  by  OLS.  This  is  based  on 
Sentana  (1997).  Consider  a  system  of  three  SUR  equations  in  which  the  explanatory  variables  for 
the  first  equation  are  a  subset  of  the  explanatory  variables  for  the  second  equation,  which  are  in 
turn  a  subset  of  the  explanatory  variables  for  the  third  equation. 

(a)  Show  that  SUR  applied  to  the  first  two  equations  is  the  same  (for  those  equations)  as  SUR 
applied  to  all  three  equations.  Hint:  See  Schmidt  (1978). 

(b)  Using  part  (a)  show  that  SUR  for  the  first  equation  is  equivalent  to  OLS. 

(c)  Using  parts  (a)  and  (b)  show  that  SUR  for  the  second  equation  is  equivalent  to  OLS  on  the 
second  equation  with  one  additional  regressor.  The  extra  regressor  is  the  OLS  residuals  from 
the  first  equation.  Hint:  Use  Telser’s  (1964)  results. 

(d)  Using  parts  (a),  (b)  and  (c)  show  that  SUR  for  the  third  equation  is  equivalent  to  OLS  on  the 
third  equation  with  the  residuals  from  the  regressions  in  parts  (b)  and  (c)  as  extra  regressors. 

17.  Growth  and  Inequality.  Lundberg  and  Squire  (2003).  See  example  2,  section  10.5.  The  data  con¬ 
tains  119  observations  for  38  countries  over  the  period  1965-1990,  and  can  be  obtained  from 
http:  /  /  www.res.org.uk/economic/datasets  /  datasetlist  .asp. 

(a)  Estimate  these  equations  using  SUR,  see  Table  10.1,  and  verify  the  results  reported  in  Table 
1  of  Lundberg  and  Squire  (2003,  p.  334).  These  results  show  that  openness  enhances  growth 
and  education  reduces  inequality. 

(b)  Report  the  Breusch-Pagan  test  for  diagonality  of  the  variance-covariance  matrix  of  the  dis¬ 
turbances  of  the  two  equations.  Compare  the  SUR  estimates  in  part  (a)  to  OLS  on  each 
equation  separately. 
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CHAPTER  11 

Simultaneous  Equations  Model 

11.1  Introduction 

Economists  formulate  models  for  consumption,  production,  investment,  money  demand  and 
money  supply,  labor  demand  and  labor  supply  to  attempt  to  explain  the  workings  of  the  econ¬ 
omy.  These  behavioral  equations  are  estimated  equation  by  equation  or  jointly  as  a  system 
of  equations.  These  are  known  as  simultaneous  equations  models.  Much  of  today’s  economet¬ 
rics  have  been  influenced  and  shaped  by  a  group  of  economists  and  econometricians  known  as 
the  Cowles  Commission  who  worked  together  at  the  University  of  Chicago  in  the  late  1940’s, 
see  Chapter  1.  Simultaneous  equations  models  had  their  genesis  in  economics  during  that  pe¬ 
riod.  Haavelmo’s  (1944)  work  emphasized  the  use  of  the  probability  approach  to  formulating 
econometric  models.  Koopmans  and  Marschak  (1950)  and  Koopmans  and  Hood  (1953)  in  two 
influential  Cowles  Commission  monographs  provided  the  appropriate  statistical  procedures  for 
handling  simultaneous  equations  models.  In  this  chapter,  we  first  give  simple  examples  of  simul¬ 
taneous  equations  models  and  show  why  the  least  squares  estimator  is  no  longer  appropriate. 
Next,  we  discuss  the  important  problem  of  identification  and  give  a  simple  necessary  but  not 
sufficient  condition  that  helps  check  whether  a  specific  equation  is  identified.  Sections  11.2  and 
11.3  give  the  estimation  of  a  single  and  a  system  of  equations  using  instrumental  variable  pro¬ 
cedures.  Section  11.4  gives  a  test  of  over-identification  restrictions  whereas,  section  11.5  gives  a 
Hausman  specification  test.  Section  11.6  concludes  with  an  empirical  example.  The  Appendix 
revisits  the  identification  problem  and  gives  a  necessary  and  sufficient  condition  for  identifica¬ 
tion. 

11.1.1  Simultaneous  Bias 

Example  1:  Consider  a  simple  Keynesian  model  with  no  government 

Ct  =  ot  +  (3Yt  +  Ut  t  =  1 , 2, . . . ,  T  (11-1) 

Yt  =  Ct  +  It  (11.2) 

where  Ct  denotes  consumption,  Yt  denotes  disposable  income,  and  It  denotes  autonomous  in¬ 
vestment.  This  is  a  system  of  two  simultaneous  equations,  also  known  as  structural  equations 

with  the  second  equation  being  an  identity.  The  first  equation  can  be  estimated  by  OLS  giving 

Pols  =  ELi  Vtct/ ELi  Vt  and  Pols  =  C-  Pols Y  (n-3) 

with  yt  and  ct  denoting  Yt  and  Ct  in  deviation  form,  i.e. ,  yt  =  Yt  —  Y.  and  Y  =  Ylt=iYt/T. 
Since  It  is  autonomous,  it  is  an  exogenous  variable  determined  outside  the  system,  whereas  Ct 
and  Yt  are  endogenous  variables  determined  by  the  system.  Let  us  solve  for  Yt  and  Ct  in  terms 
of  the  constant  and  It-  The  resulting  two  equations  are  known  as  the  reduced  form  equations 

Ct  =  a/(l-p)  +  pit(l-p)  +  ut/0.-l3)  (11.4) 

Yt  =  a/(l-P)  +  It/(l-P)  +  ut/(l-P)  (11.5) 
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These  equations  express  each  endogenous  variable  in  terms  of  exogenous  variables  and  the  error 
terms.  Note  that  both  Yj  and  Ct  are  a  function  of  ut,  and  hence  both  are  correlated  with  ut .  In 
fact,  Yt  -  E(Yt )  =  ut/{  1  -  P),  and 

cov(yt,  m)  =  E[(Yt  -  E(Yt))ut\  =  a2J(  1  -  /?)  >  0  if  0  <  /3  <  1  (11.6) 

This  holds  because  ut  ~  (0,  <r^)  and  It  is  exogenous  and  independent  of  the  error  term.  Equation 
(11.6)  shows  that  the  right  hand  side  regressor  in  (11.1)  is  correlated  with  the  error  term.  This 
causes  the  OLS  estimates  to  be  biased  and  inconsistent.  In  fact,  from  (11.1), 

ct  =  Ct-  C  =  (3yt  +  ( ut  -  u) 

and  substituting  this  expression  in  (11.3),  we  get 

Pols  =  P  +  Ytt=i  ytut/J2t=i  Ut  (H-?) 

From  (11.7),  it  is  clear  that  E((30ls)  /  /3,  since  the  expected  value  of  the  second  term  is  not 
necessarily  zero.  Also,  using  (11.5)  one  gets 

Vt  =  Yt  -  Y  =  [it  +  ( ut  -  u)\/ (1  -  (3) 

where  it  =  It  -  I  and  I  =  J2t=i  h/T.  Defining  myy  =  YPt=i  Vt/Ti  we  Set 

myy  =  (mu  +  2miu  +  muu)/(l  -  /3)2  (11.8) 

where  mu  =  YPt=i  it/Ti  miu  =  YPt= l  ~  U)/T  and  =  Ylt=i(ut  ~  u)2/T-  Als°) 

myu  =  (miu  +  muu)/(l  -  (3)  (11.9) 

Using  the  fact  that  plim  mtu  =  0  and  plirn  muu  =  cr^,  we  get 

plirn  POLS  =  (3  +  plim  ( myu/myy )  =  (3  +  [cr£(  1  -  /3)/(plim  mu  +  a l)] 


which  shows  that  Pols  overstates  (3  if  0  <  (3  <  1. 

Example  2:  Consider  a  simple  demand  and  supply  model 

Qt  =  cl  +  (3Pt  +  u\t  (11.10) 

Qt  =  1  +  dPt  +  U2t  (11-11) 

Qt  =  Qt  =  Qt  t  =  1,2, ...  ,T  (11.12) 

Substituting  the  equilibrium  condition  (11.12)  in  (11.10)  and  (11.11),  we  get 

Qt  =  ol  +  (3Pt  +  u\t  (11.13) 

Qt  =  7  +  SPt  +  u2t  t  =  1,2, . . .  ,T  (11.14) 


For  the  demand  equation  (11.13),  the  sign  of  (3  is  expected  to  be  negative,  while  for  the  supply 
equation  (11.14),  the  sign  of  8  is  expected  to  be  positive.  However,  we  only  observe  one  equilib¬ 
rium  pair  ( Qt ,  Pt)  and  these  are  not  labeled  demand  or  supply  quantities  and  prices.  When  we 
run  the  OLS  regression  of  Qt  on  Pt  we  do  not  know  what  we  are  estimating,  demand  or  supply? 
In  fact,  any  linear  combination  of  (11.13)  and  (11.14)  looks  exactly  like  (11.13)  or  (11.14).  It 
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will  have  a  constant,  Price,  and  a  disturbance  term  in  it.  Since  demand  or  supply  cannot  be 
distinguished  from  this  ‘mongrel’  we  have  what  is  known  as  an  identification  problem.  If  the 
demand  equation  (or  the  supply  equation)  looked  different  from  this  mongrel,  then  this  partic¬ 
ular  equation  would  be  identified.  More  on  this  later.  For  now  let  us  examine  the  properties  of 
the  OLS  estimates  of  the  demand  equation.  It  is  well  known  that 

Pols  =  ELi  QtPt/ ELi  Pt  =  P  +  Y^=iPt(uu  -  ui)/T,LiPt  (11.15) 

where  qt  and  pt  denote  Qt  and  Pt  in  deviation  form,  i.e.,  qt  =  Qt  —  Q-  This  estimator  is 
unbiased  depending  on  whether  the  last  term  in  (11.15)  has  zero  expectations.  In  order  to  find 
this  expectation  we  solve  the  structural  equations  in  (11.13)  and  (11.14)  for  Qt  and  Pt 

Qt  =  ( a6  -  7/3)/ (6-/3)  +  (6uu  -  Pu2t)/ (6  -  P)  (11.16) 

Pt  =  (a-  7)/ (6-/3)  +  (uu  -  u2t)/ (6  -  P)  (11-17) 

(11.16)  and  (11.17)  are  known  as  the  reduced  form  equations.  Note  that  both  Qt  and  Pt  are 
functions  of  both  errors  u\  and  u2.  Hence,  Pt  is  correlated  with  uu-  In  fact, 

Pt  =  (uu  -  ui)/(6  -  P)  -  (u2t  -  u2)/(6  -  (3 )  (11.18) 

and 

Plim  Yjt=  1  Pt(u\t  -  ui)/T  =  (c  11  -  a  12)/ (6  -  (3)  (11.19) 

Plim  Yh= 1  Pt/T  =  (Til  +  0-22  -  2072)/ (6  -  P)2  (11.20) 

where  07,  =  cov(uu,  Ujt )  for  i,j  =  1,2;  and  t  =  1, . . .  ,T.  Hence,  from  (11.15) 

plim  fiOLS  =  (3+  (o-n  -  o-i2)(<S  -  P)/(crn  +  a22  -  2<j12)  (11.21) 

and  the  last  term  is  not  necessarily  zero,  implying  that  Pols  is  n°t  consistent  for  (3.  Similarly, 
one  can  show  that  the  OLS  estimator  for  6  is  not  consistent,  see  problem  1.  This  simultaneous 
bias  is  once  again  due  to  the  correlation  of  the  right  hand  side  variable  (price)  with  the  error 
term  u  \ .  This  correlation  could  be  due  to  the  fact  that  Pt  is  a  function  of  u2t,  from  (11.17), 
and  u2t  and  u\t  are  correlated,  making  Pt  correlated  with  uu-  Alternatively,  Pt  is  a  function  of 
Qt,  from  (11.13)  or  (11.14),  and  Qt  is  a  function  of  uu,  from  (11.13),  making  Pt  a  function  of 
uu-  Intuitively,  if  a  shock  in  demand  (i.e.,  a  change  in  u±t)  shifts  the  demand  curve,  the  new 
intersection  of  demand  and  supply  determines  a  new  equilibrium  price  and  quantity.  This  new 
price  is  therefore,  affected  by  the  change  in  uu,  and  is  correlated  with  it. 

In  general,  whenever  a  right  hand  side  variable  is  correlated  with  the  error  term,  the  OLS 
estimates  are  biased  and  inconsistent.  We  refer  to  this  as  an  endogeneity  problem.  Recall,  Figure 
3  of  Chapter  3  with  co v(Pt,  uu)  >  0.  This  shows  that  Pfi s  above  their  mean  are  on  the  average 
associated  with  nu’s  above  their  mean,  (i.e.,  uu  >  0).  This  implies  that  the  quantity  Qt  asso¬ 
ciated  with  this  particular  Pt  is  on  the  average  above  the  true  line  (a  +  PPt).  This  is  true  for 
all  observations  to  the  right  of  E(Pt).  Similarly,  any  Pt  to  the  left  of  E(Pt)  is  on  the  average 
associated  with  a  uu  below  its  mean,  (i.e.,  uu  <  0).  This  implies  that  quantities  associated  with 
prices  below  their  mean  E(Pt)  are  on  the  average  data  points  that  lie  below  the  true  line.  With 
this  observed  data,  the  estimated  line  using  OLS  will  always  be  biased.  In  this  case,  the  intercept 
estimate  is  biased  downwards,  whereas  the  slope  estimate  is  biased  upwards.  This  bias  does  not 
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disappear  with  more  data,  as  any  new  observation  will  on  the  average  be  either  above  the  true 
line  if  Pt  >  E(Pt )  or  below  the  line  if  Pt  <  E(Pt).  Hence,  these  OLS  estimates  are  inconsistent. 

Deaton  (1997,  p.  95)  has  a  nice  discussion  of  endogeneity  problems  in  development  economics. 
One  important  example  pertains  to  farm  size  and  farm  productivity.  Empirical  studies  using 
OLS  have  found  an  inverse  relationship  between  productivity  as  measured  by  log  (Output /Acre) 
and  farm  size  as  measured  by  (Acreage).  This  seems  counter-intuitive  as  it  suggests  that  smaller 
farms  are  more  productive  than  larger  farms.  Economic  explanations  of  this  phenomenon  in¬ 
clude  the  observation  that  hired  labor  (which  is  typically  used  on  large  farms)  is  of  lower  quality 
than  family  labor  (which  is  typically  used  on  small  farms).  The  latter  needs  less  monitoring 
and  can  be  entrusted  with  valuable  animals  and  machinery.  Another  explanation  is  that  this 
phenomenon  is  an  optimal  response  by  small  farmers  to  uncertainty.  It  could  also  be  a  sign  of 
inefficiency  as  farmers  work  too  much  on  their  own  farms  pushing  their  marginal  productivity 
below  market  wage.  How  could  this  be  an  endogeneity  problem?  After  all,  the  amount  of 
land  is  outside  the  control  of  the  farmer.  This  is  true,  but  that  does  not  mean  that  acreage 
is  uncorrelated  with  the  disturbance  term.  After  all,  size  is  unlikely  to  be  independent  of  the 
quality  of  land.  “Desert  farms  that  are  used  for  low-intensity  animal  grazing  are  typically  larger 
than  garden  farms,  where  the  land  is  rich  and  output/acre  is  high.”  In  this  case,  land  quality  is 
negatively  correlated  with  land  size.  It  takes  more  acres  to  sustain  a  cow  in  West  Texas  than  in 
less  arid  areas.  This  negative  correlation  between  acres,  the  explanatory  variable  and  quality 
of  land  which  is  an  omitted  variable  included  in  the  error  term  introduces  endogeneity.  This 
in  turn  results  in  downward  bias  of  the  OLS  estimate  of  acreage  on  productivity. 

Endogeneity  can  also  be  caused  by  sample  selection.  Gronau  (1973)  observed  that  women  with 
small  children  had  higher  wages  than  women  with  no  children.  An  economic  explanation  is  that 
women  with  children  have  higher  reservation  wages  and  as  a  result  fewer  of  them  work.  Of  those 
that  work,  their  observed  wages  are  higher  than  those  without  children.  The  endogeneity  works 
through  the  unobserved  component  in  the  working  women’s  wage  that  induces  her  to  work.  This 
is  positively  correlated  with  the  number  of  children  she  has  and  therefore  introduces  upward 
biases  in  the  OLS  estimate  of  the  effect  of  the  number  of  children  on  wages. 


11.1.2  The  Identification  Problem 

In  general,  we  can  think  of  any  structural  equation,  say  the  first,  as  having  one  left  hand 
side  endogenous  variable  y\ ,  g\  right  hand  side  endogenous  variables,  and  k\  right  hand  side 
exogenous  variables.  The  right  hand  side  endogenous  variables  are  correlated  with  the  error 
term  rendering  OLS  on  this  equation  biased  and  inconsistent.  Normally,  for  each  endogenous 
variable,  there  exists  a  corresponding  structural  equation  explaining  its  behavior  in  the  model. 
We  say  that  a  system  of  simultaneous  equations  is  complete  if  there  are  as  many  endogenous 
variables  as  there  are  equations.  To  correct  for  the  simultaneous  bias  we  need  to  replace  the 
right  hand  side  endogenous  variables  in  this  equation  by  variables  which  are  highly  correlated 
with  the  ones  they  are  replacing  but  not  correlated  with  the  error  term.  Using  the  method  of 
instrumental  variable  estimation,  discussed  below,  we  will  see  that  these  variables  turn  out  to 
be  the  predictors  obtained  by  regressing  each  right  hand  side  endogenous  variable  on  a  subset  of 
all  the  exogenous  variables  in  the  system.  Let  us  assume  that  there  are  K  exogenous  variables 
in  the  simultaneous  system.  What  set  of  exogenous  variables  should  we  use  that  would  lead 
to  consistent  estimates  of  this  structural  equation?  A  search  for  the  minimum  set  needed  for 
consistency  leads  us  to  the  order  condition  for  identification. 
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The  Order  Condition  for  Identification:  A  necessary  condition  for  identification  of  any 
structural  equation  is  that  the  number  of  excluded  exogenous  variables  from  this  equation  are 
greater  than  or  equal  to  the  number  of  right  hand  side  included  endogenous  variables.  Let  K 
be  the  number  of  exogenous  variables  in  the  system,  then  this  condition  requires  k2  >  <?i,  where 
k2  =  K  -  k\. 

Let  us  consider  the  demand  and  supply  equations  given  in  (11.13)  and  (11.14)  but  assume 
that  the  supply  equation  has  in  it  an  extra  variable  Wt  denoting  weather  conditions.  In  this 
case  the  demand  equation  has  one  right  hand  side  endogenous  variable  Pt,  i.e. ,  g i  =  1  and 
one  excluded  exogenous  variable  Wt,  making  k2  =  1.  Since  k2  >  gi,  this  order  condition  is 
satisfied,  in  other  words,  based  on  the  order  condition  alone  we  cannot  conclude  that  the 
demand  equation  is  unidentified.  The  supply  equation,  however,  has  g\  =  1  and  k2  =  0,  making 
this  equation  unidentified,  since  it  does  not  satisfy  the  order  condition  for  identification.  Note 
that  this  condition  is  only  necessary  but  not  sufficient  for  identification.  In  other  words,  it 
is  useful  only  if  it  is  not  satisfied,  in  which  case  the  equation  in  question  is  not  identified. 
Note  that  any  linear  combination  of  the  new  supply  and  demand  equations  would  have  a 
constant,  price  and  weather.  This  looks  like  the  supply  equation  but  not  like  demand.  This 
is  why  the  supply  equation  is  not  identified.  In  order  to  prove  once  and  for  all  whether  the 
demand  equation  is  identified,  we  need  the  rank  condition  for  identification  and  this  will  be 
discussed  in  details  in  the  Appendix  to  this  chapter.  Adding  a  third  variable  to  the  supply 
equation  like  the  amount  of  fertilizer  used  Ft  will  not  help  the  supply  equation  any,  since  a 
linear  combination  of  supply  and  demand  will  still  look  like  supply.  However,  it  does  help  the 
identification  of  the  demand  equation.  Denote  by  £  =  k2  —  gi,  the  degree  of  over-identification. 
In  (11.13)  and  (11.14)  both  equations  are  unidentified  (or  under-identified)  with  £  =  —1.  When 
Wt  is  added  to  the  supply  equation,  £  =  0  for  the  demand  equation,  and  it  is  just-identified. 
When  both  Wt  and  Ft  are  included  in  the  supply  equation,  £  =  1  and  the  demand  equation  is 
over-identified. 

Without  the  use  of  matrices,  we  can  describe  a  two-stage  least  squares  method  that  will 
estimate  the  demand  equation  consistently.  First,  we  run  the  right  hand  side  endogenous  variable 
Pt  on  a  constant  and  Wt  and  get  Pt,  then  replace  Pt  in  the  demand  equation  with  Pt  and  perform 
this  second  stage  regression.  In  other  words,  the  first  step  regression  is 

Pt  =  TTll  +  Kl2Wt  +  Vf  (11.22) 

with  vt  =  Pt  —  Pt  satisfying  the  OLS  normal  equations  Y2t=i  Ft  =  E^= i  vfWt  =  0.  The  second 
stage  regression  is 

Qt  =  ol  +  (dPt  +  et  (11.23) 

with  Ylt=i^t  =  EL %Pt  =  0.  Using  (11.13)  and  (11.23),  we  can  write 

et  =  f3(Pt  -  Pt)  +  ult  =  j3vt  +  uu  (11.24) 

so  that  ELi  et  =  ELi  UU  and  ELi  U  A  =  ELi  uuPt  using  the  fact  that  ELi  U  = 
Ef=i  vtPt  =  0.  So  the  new  error  et  behaves  as  the  original  disturbance  u\t.  However,  our 
right  hand  side  variable  is  now  Pt  which  is  independent  of  u\t  since  it  is  a  linear  combination 
of  exogenous  variables  only.  We  essentially  decomposed  Pt  into  two  parts,  the  first  part  Pt  is  a 
linear  combination  of  exogenous  variables  and  therefore,  independent  of  the  uifs.  The  second 
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part  is  vt  which  is  correlated  with  u\t-  In  fact,  this  is  the  source  of  simultaneous  bias.  The 
two  parts  Pt  and  vt  are  orthogonal  to  each  other  by  construction.  Hence  when  the  vt  s  become 
part  of  the  new  error  ej,  they  are  orthogonal  to  the  new  regressor  Pt.  Furthermore,  Pt  is  also 
independent  of  uu- 

Why  would  this  procedure  not  work  on  the  estimation  of  (11.13)  if  the  model  is  given  by 
equations  (11.13)  and  (11.14).  The  answer  is  that  in  (11.22)  we  will  only  have  a  constant,  and 
no  Wf.  When  we  try  to  run  the  second-stage  regression  in  (11.23)  the  regression  will  fail  because 
of  perfect  multicollinearity  between  the  constant  and  Pt .  This  will  happen  whenever  the  order 
condition  is  not  satisfied  and  the  equation  is  not  identified,  see  Kelejian  and  Oates  (1989). 
Hence,  in  order  for  it  to  succeed  in  the  second  stage  we  need  at  least  one  excluded  exogenous 
variable  from  the  demand  equation  that  is  in  the  supply  equation,  i.e. ,  variables  like  Wt  or 
Ft-  Therefore,  whenever  the  second-stage  regression  fails  because  of  perfect  multicollinearity 
between  the  right  hand  side  regressors,  this  implies  that  the  order  condition  of  identification  is 
not  satisfied. 

In  general,  if  we  are  given  an  equation  like 

Vi  =  oti2V2  +  PnXi  +  (d12X2  +  ui  (11.25) 

the  order  condition  requires  the  existence  of  at  least  one  exogenous  variable  excluded  from 
(11.25),  say  X3.  These  extra  exogenous  variables  like  X3  usually  appear  in  other  equations  of 
our  simultaneous  equation  model.  In  the  first  step  regression  we  run 

V2  =  ^21X1  +  ^-2,2X2  +  IT  2^X3  +  V2  (11.26) 

with  the  OLS  residuals  V2  satisfying 

Ylt=i  V2t.X\t  =  0;  YSLi  V2tX2t  =  0;  Ylt=i  v2tX3t  =  0  (11.27) 

and  in  the  second  step,  we  run  the  regression 


2/1  —  «122/2  +  Pi\Xi  +  P12X2  +  C\ 


(11.28) 


where  e\  =  012(2/2  —  2/2)  +  u\  =  0.12^2  +  u\  ■  This  regression  will  lead  to  consistent  estimates, 
because 


Xu=i  2/2 Pit  =  E*=i  y2tU\u  Xu=i  X\te\t  =  J2t= 1  XU'PU 
Yn= 1  ^2  Pit  =  Ya= 1  x2tuu 


(11.29) 


and  u\t  is  independent  of  the  exogenous  variables.  In  order  to  solve  for  3  structural  parameters 
«12,  /3n  and  /312  one  needs  three  linearly  independent  OLS  normal  equations.  YlJ=i  2/2t?i t  =  0 
is  a  new  piece  of  information  provided  1/2  is  regressed  on  at  least  one  extra  variable  besides 
X\  and  X2.  Otherwise,  J2t=ixiPit  =  J2t=i  x2Pit  =  0  are  the  only  two  linearly  independent 
normal  equations  in  three  structural  parameters. 

What  happens  if  there  is  another  right  hand  side  endogenous  variable,  say  1/3?  In  that  case 
(11.25)  becomes 


2/1  —  0122/2  +  0132/3  +  PnXi  +  @32X2  +  ui  (11.30) 

Now  we  need  at  least  two  exogenous  variables  that  are  excluded  from  (11.30)  for  the  order 
condition  to  be  satisfied,  and  the  second  stage  regression  to  run.  Otherwise,  we  will  have  less 
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linearly  independent  equations  than  there  are  structural  parameters  to  estimate,  and  the  second 
stage  regression  will  fail.  Also,  7/2  and  7/3  should  be  regressed  on  the  same  set  of  exogenous  vari¬ 
ables.  Furthermore,  this  set  of  second-stage  regressors  should  always  include  the  right  hand  side 
exogenous  variables  of  (11.30).  These  two  conditions  will  ensure  consistency  of  the  estimates. 
Let  A3  and  A4  be  the  excluded  exogenous  variables  from  (11.30).  Our  first  step  regression 
would  regress  y2  and  2/3  on  Ai,  A2,  A3  and  A4  to  get  7/2  and  7/3,  respectively.  The  second  stage 
regression  would  regress  7/1  on  7/2 ,  2/3,  Ai  and  A2.  From  the  first  step  regressions  we  have 

2/2  =  2/2  +  v2  and  7/3  =  2/3  +  v3  (11.31) 

where  7/2  and  2/3  are  linear  combinations  of  the  A’s,  and  V2  and  v3  are  the  residuals.  The  second 
stage  regression  has  the  following  normal  equations 

EL  2/27+7  =  E[=1  2/37+7  =  TJ=1  Xiteit  =  Ef=i  -^27+7  =  0  (11.32) 

where  e"i  denotes  the  residuals  from  the  second  stage  regression.  In  fact 

£1=012772  +  013773  +  771  (11.33) 

Now  Et=i  +72/27  =  Et=i  77172/27  because  Et=i  77272/27  =  Yyt=i  +S72/27  =  0.  The  latter  holds  because 
7/2)  the  predictor,  is  orthogonal  to  t>2,  the  residual.  Also,  7/2  is  orthogonal  to  v3  if  7/2  is  regressed  on 
a  set  of  A’s  that  are  a  subset  of  the  regressors  included  in  the  first  step  regression  of  7/3.  Similarly, 
Ef=i  +72/37  =  Et=i  77 17 2/37  if  2/3  is  regressed  on  a  set  of  exogenous  variables  that  are  a  subset  of 
the  A’s  included  in  the  first  step  regression  of  7/2-  Combining  these  two  conditions  leads  to  the 
following  fact:  7/2  and  7/3  have  to  be  regressed  on  the  same  set  of  exogenous  variables  for  the 
composite  error  term  to  behave  like  the  original  error.  Furthermore  these  exogenous  variables 
should  include  the  included  X ’s  on  the  right  hand  side  of  the  equation  to  be  estimated,  i.e. ,  Ai 
and  A2,  otherwise,  E7L1  +7-E7  is  not  necessarily  equal  to  Ylt=iuuXit,  because  E^Li+27^17 
or  E7L1  7737 A k  are  not  necessarily  zero.  For  further  analysis  along  these  lines,  see  problem  2. 


11.2  Single  Equation  Estimation:  Two-Stage  Least  Squares 

In  matrix  form,  we  can  write  the  first  structural  equation  as 

7/1  =  Yiaq  +  XiP1  +  77i  =  Zi8\  +  77i  (11.34) 

where  7/1  and  u\  are  (T  x  1),  Y\  denotes  the  right  hand  side  endogenous  variables  which  is 
(T  x  g\)  and  Ai  is  the  set  of  right  hand  side  included  exogenous  variables  which  is  (T  x  k±),  a± 
is  of  dimension  g\  and  Pi  is  of  dimension  k\.  Z\  =  [Yi,  Ai]  and  8\  =  (cc^,/?^).  We  require  the 
existence  of  excluded  exogenous  variables,  from  (11.34),  call  them  A+,  enough  to  identify  this 
equation.  These  excluded  exogenous  variables  appear  in  the  other  equations  in  the  simultaneous 
model.  Let  the  set  of  all  exogenous  variables  be  A  =  [Ai,  A2]  where  A  is  of  dimension  (T  x  k). 
For  the  order  condition  to  be  satisfied  for  equation  (11.34)  we  must  have  [k  —  k\)  >  g±.  If  all  the 
exogenous  variables  in  the  system  are  included  in  the  first  step  regression,  i.e.,  Yi  is  regressed  on 
A  to  get  Y] .  the  resulting  second  stage  least  squares  estimator  obtained  from  regressing  7/1  on 
Yi  and  Ai  is  called  two-stage  least  squares  (2SLS).  This  method  was  proposed  independently 
by  Basmann  (1957)  and  Theil  (1953).  In  matrix  form  Y±  =  PxY\  is  the  predictor  of  the  right 
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hand  side  endogenous  variables,  where  Px  is  the  projection  matrix  X (X1  X)  1X'.  Replacing  Y\ 
by  Y\  in  (11.34),  we  get 

yi  =  Y\a\  +  X1pi  +  wi  =  Z\b\  +  w1  (11.35) 

where  Z\  =  [Yj, Xi]  and  w\  =  u\  +  {Y\  —  Yjjaq.  Running  OLS  on  (11.35)  one  gets 

<5i,2  sls  =  {Z[Zi)~l  Z[y\  =  (Z[PxZi)~l  Z[Pxyi  (11.36) 

where  the  second  equality  follows  from  the  fact  that  Z\  =  PXZ\  and  the  fact  that  Px  is 
idempotent.  The  former  equality  holds  because  PxX  =  X,  hence  PxX i  =  Xi,  and  PxY\  =  Y\ . 
If  there  is  only  one  right  hand  side  endogenous  variable,  running  the  first-stage  regression  ?/2 
on  X\  and  X2  and  testing  that  the  coefficients  of  A'2  are  all  zero  against  the  hypothesis  that 
at  least  one  of  these  coefficients  is  different  from  zero  is  a  test  for  rank  identification.  In  case 
of  several  right  hand  side  endogenous  variables,  things  get  complicated,  see  Cragg  and  Donald 
(1996),  but  one  can  still  run  the  first-stage  regressions  for  each  right  hand  side  endogenous 
variable  to  make  sure  that  at  least  one  element  of  X2  is  significantly  different  from  zero.1  This 
is  not  sufficient  for  the  rank  condition  but  it  is  a  good  diagnostic  for  whether  the  rank  condition 
fails.  If  we  fail  to  meet  this  requirement  we  should  question  our  2SLS  estimator. 

Two-stage  least  squares  can  also  be  thought  of  as  a  simple  instrumental  variables  estimator 
with  the  set  of  instruments  W  =  Z\  =  [Lj ,  X]  ] .  Recall  that  Y\  is  correlated  with  u\,  rendering 
OLS  inconsistent.  The  idea  of  simple  instrumental  variables  is  to  find  a  set  of  instruments, 
say  W  for  Z\  with  the  following  properties:  (1)  plim  W'u\/T  =  0,  the  instruments  have  to 
be  exogenous ,  i.e.,  uncorrelated  with  the  error  term,  otherwise  this  defeats  the  purpose  of  the 
instruments  and  result  in  inconsistent  estimates.  (2)  plim  W'W/T  =  Qw  7^  0,  where  Qw  is 
finite  and  positive  definite,  the  W's  should  not  be  perfectly  multicollinear.  (3)  W  should  be 
highly  correlated  with  Z\,  i.e.,  the  instruments  should  be  highly  relevant ,  not  weak  instruments 
as  we  will  explain  shortly.  In  fact,  plim  W' Z\/T  should  be  finite  and  of  full  rank  ( k\  +  g\). 
Premultiplying  (11.34)  by  W' ,  we  get 

W'yi  =  W'Z i<5i  +  W'ui  (11.37) 

In  this  case,  W  =  Z\  is  of  the  same  dimension  as  Z\,  and  since  plim  W' Z\/T  is  square  and  of 
full  rank  ( k\  +51),  the  simple  instrumental  variable  (IV)  estimator  of  61  becomes 

61JV  =  {W'  Z\)~1W'yi  =  <5i  +  (W'  Zi)~1W'ui  (11.38) 

with  plim  8\jy  =  which  follows  from  (11.37)  and  the  fact  that  plim  W'ui/T  =  0. 

Digression:  In  the  general  linear  model,  y  =  X/3  +  u,  X  is  the  set  of  instruments  for  X. 
Premultiplying  by  X'  we  get  X'y  =  X'X/3  +  X'u  and  using  the  fact  that  plim  X’u/T  =  0,  one 
gets 

Prv  =  (X'X)  1  X’y  =  (3OLg . 

This  estimator  is  consistent  as  long  as  X  and  u  are  uncorrelated.  In  the  simultaneous  equation 
model  for  the  first  structural  equation  given  in  (11.34),  the  right  hand  side  regressors  Z\  include 
endogenous  variables  Y\  that  are  correlated  with  u\.  Therefore  OLS  on  (11.34)  will  lead  to 
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inconsistent  estimates,  since  the  matrix  of  instruments  W  =  Z\ ,  and  Z\  is  correlated  with  u\. 
In  fact, 

6i,ols  =  (Z[Z  =  61  +  (Z^-'Zim 


with  plirn  8\,ols  /  <5i  since  plim  Zjui/T  /  0. 

Denote  by  e\toLS  =  y\  —  Z\8\@ls  as  the  OLS  residuals  on  the  first  structural  equation,  then 


plim  s\  = 


el  ,OLSehOLS 

T  -  {gi  +  fa) 


since  the  last  term  is  positive.  Only  if  plim  Z[u\/T  is  zero  will  plim  sf  =  an,  otherwise  it  is 
smaller.  OLS  fits  very  well,  it  minimizes  ( y\  —  Zi8\)'(yi  —  Z\8i).  Since  Z\  and  u\  are  correlated, 
OLS  attributes  part  of  the  variation  in  y\  that  is  due  to  u\  incorrectly  to  the  regressor  Z\ . 

Both  the  simple  IV  and  OLS  estimators  can  be  interpreted  as  method  of  moments  estimators. 
These  were  discussed  in  Chapter  2.  For  OLS,  the  population  moment  conditions  are  given 
by  E(X'u)  =  0  and  the  corresponding  sample  moment  conditions  yield  X'(y  —  X/3)/T  =  0. 
Solving  for  [3  results  in  Pols-  Similarly,  the  population  moment  conditions  for  the  simple  IV 
estimator  in  (11.37)  are  E(W'ui)  =  0  and  the  corresponding  sample  moment  conditions  yield 
W'(yi  —  Z\8i)/T  =  0.  Solving  for  8\  results  in  8\yv  given  in  (11.38). 

If  W  =  [Y\ .  X\],  then  (11.38)  results  in 


81, IV 


'  Y(Yi  Y{X1  ' 

-1 

'  Y'yi  ' 

X'Y,  X'X, 

.  x[yi  . 

(11.39) 


which  is  the  same  as  (11.36) 


8l,2SLS 


'  v/v,  y/v,  ' 

-1 

'  y{v\  ' 

v;y  v;at 

.  x'm  . 

(11.40) 


provided  Y[Y\  =  Y[Y\ .  and  X[Y]  =  X[  Y\ .  The  latter  conditions  hold  because  Y\  =  PxY\ ,  and 
PxX1  =  V,. 

In  general,  let  X*  be  our  set  of  first  stage  regressors.  An  IV  estimator  with  Yj*  =  PX*Y±,  i.e. , 
with  every  right  hand  side  y  regressed  on  the  same  set  of  regressors  X* ,  will  satisfy 

Y*'Y*  =  Yf  PX*Y\  =  Yf'Y\ 

In  addition,  for  X[Yf  to  equal  X[Y\,  X\  has  to  be  a  subset  of  the  regressors  in  X*.  Therefore 
X*  should  include  X\  and  at  least  as  many  A’s  from  X2  as  is  required  for  identification,  i.e., 
(at  least  g\  of  the  X’s  from  X2).  In  this  case,  the  IV  estimator  using  W*  =  [Y)*,  X\]  will  result 
in  the  same  estimator  as  that  obtained  by  a  two  stage  regression  where  in  the  first  step  Yf 
is  obtained  by  regressing  Y\  on  X* ,  and  in  the  second  step  y±  is  regressed  on  W*.  Note  that 
these  are  the  same  conditions  required  for  consistency  of  an  IV  estimator.  Note  also,  that  if  this 
equation  is  just-identified,  then  there  is  exactly  g\  of  the  A’s  excluded  from  that  equation.  In 
other  words,  X2  is  of  dimension  (T  x  g\),  and  X*  =  X  is  of  dimension  T  x  (g±  +  k±).  Problem 
3  shows  that  2SLS  in  this  case  reduces  to  an  IV  estimator  with  W  =  X,  i.e. 

8i,2SLS  =  8ijv  =  (X'Z1)-lX'y1 


(11.41) 
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Note  that  if  the  first  equation  is  over-identified,  then  X' Z\  is  not  square  and  (11.41)  cannot  be 
computed. 

Rather  than  having  W,  the  matrix  of  instruments,  be  of  exactly  the  same  dimension  as  Z\ 
which  is  required  for  the  expression  in  (11.38),  one  can  define  a  generalized  instrumental  variable 
in  terms  of  a  general  matrix  W  of  dimension  T  x  l  where  l  >  g\  +  k\.  The  latter  condition  is 
the  order  condition  for  identification.  In  this  case,  8±jy  is  obtained  as  GLS  on  (11.37).  Using 
the  fact  that 

plirn  W'u\u'xW/T  =  an  plirn  W'W/T , 


one  gets 

Si  ,iv  =  (ZiPwZ^ZiPwy!  =  8 1  +  (Z^wZ^Z^wm 

with  plirn  8\jy  =  8i  and  limiting  covariance  matrix  an  plirn  (Z[P\y Z\/T)~l .  Therefore,  2SLS 
can  be  obtained  as  a  generalized  instrumental  variable  estimator  with  W  =  X.  This  also  means 
that  2SLS  of  8\  can  be  obtained  as  GLS  on  (11.34)  after  premultiplication  by  X' ,  see  problem 
4.  Note  that  GLS  on  (11.37)  minimizes  (y±  —  Z±8i)'  Pw(yi  —  Z\8i)  which  yields  the  first-order 
conditions 


Z[Pw(yi  ~  Zi8i,iv)  =  0 

the  solution  of  which  is  8\jy  =  (Z\  P\yZ\  Z\  P\\/y\ .  It  can  also  be  shown  that  2SLS  and 
the  generalized  instrumental  variables  estimators  are  special  cases  of  a  Generalized  Method  of 
Moments  (GMM)  estimator  considered  by  Hansen  (1982).  See  Davidson  and  MacKinnon  (1993) 
and  Hall  (1993)  for  an  introduction  to  GMM. 

For  the  matrix  Z'^Py/Zx  to  be  of  full  rank  and  invertible,  a  necessary  condition  is  that  W  must 
be  of  full  rank  t  >  (g\  +  £q).  This  is  in  fact,  the  order  condition  of  identification.  If  i  =  g\  +  k\, 
then  this  equation  is  just-identified.  Also,  W' Z\  is  square  and  nonsingular.  Problem  10  asks 
the  reader  to  verify  that  the  generalized  instrumental  variable  estimator  reduces  to  the  simple 
instrumental  variable  estimator  given  in  (11.38).  Also,  under  just-identification  the  minimized 
value  of  the  criterion  function  is  zero. 

One  of  the  biggest  problems  with  IV  estimation  is  the  choice  of  the  instrumental  variables 
W.  We  have  listed  some  necessary  conditions  for  this  set  of  instruments  to  yield  consistent 
estimators  of  the  structural  coefficients.  However,  different  choices  by  different  researchers  may 
yield  different  estimates  in  finite  samples.  Using  more  instruments  will  yield  more  efficient  IV 
estimation.  Let  Wi  and  W2  be  two  sets  of  IV’s  with  W 1  being  spanned  by  the  space  of  WV  In 
this  case,  P\y2 W\  =  W\  and  therefore,  Pw2Pwx  =  Pwi-  F°r  the  corresponding  IV  estimators 

81, Wi  =  {Z[PwiZi)~l Z[PWiyi  for  i  =  1,2 


are  both  consistent  for  61  as  long  as  plirn  W[u\/T  =  0  and  have  asymptotic  covariance  matrices 


a  11  plirn  (Z[Pw.Zi/T)  1 


Note  that  8\^w2  Is  at  least  as  efficient  as  8i^Wn  ^  the  difference  in  their  asymptotic  covariance 
matrices  is  positive  semi-definite,  i.e. ,  if 


o'  11 


plirn 


Z[PWlZi 
T 


1  -1 


O’!! 


plirn 


Z[Pw2Z  1 

T 


-1 
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is  p.s.d.  This  holds,  if  Z[ P\y2Z\  —  Z\ Pqq  Z\  is  p.s.d.  This  last  condition  holds  since  P\y2  ~ Pwi  is 
idempotent.  Problem  11  asks  the  reader  to  verify  this  result.  8\yv2  is  more  efficient  than  8\yv-i 
since  explains  Z\  at  least  as  well  as  W\.  This  seems  to  suggest  that  one  should  use  as  many 
instruments  as  possible.  If  T  is  large  this  is  a  good  strategy.  But,  if  T  is  finite,  there  will  be  a 
trade-off  between  this  gain  in  asymptotic  efficiency  and  the  introduction  of  more  finite  sample 
bias  in  our  IV  estimator. 

In  fact,  the  more  instruments  we  use,  the  more  will  Y\  resemble  Y\  and  the  more  bias  is 
introduced  in  this  second  stage  regression.  The  extreme  case  where  Yi  is  perfectly  predicted 
by  Yi  returns  us  to  OLS  which  we  know  is  biased.  On  the  other  hand,  if  our  set  of  instru¬ 
ments  have  little  ability  in  predicting  Yi,  then  the  resulting  instrumental  variable  estimator 
will  be  inefficient  and  its  asymptotic  distribution  will  not  resemble  its  finite  sample  distribu¬ 
tion,  see  Nelson  and  Startz  (1990).  If  the  number  of  instruments  is  fixed  and  the  coefficients 
of  the  instruments  in  the  first  stage  regression  go  to  zero  at  the  rate  l/\/T,  indicating  weak 
correlation,  Staiger  and  Stock  (1997)  find  that  even  as  T  increases,  IV  estimation  is  not  consis¬ 
tent  and  has  a  nonstandard  asymptotic  distribution.  Bound  et  al.  (1995)  recommend  reporting 
the  R 2  or  the  F-statistic  of  the  first  stage  regression  as  a  useful  indicator  of  the  quality  of  IV 
estimates. 

Instrumental  variables  are  important  for  obtaining  consistent  estimates  when  endogeneity  is 
suspected.  However,  invalid  instruments  can  produce  meaningless  results.  How  do  we  know 
whether  our  instruments  are  valid?  Stock  and  Watson  (2003)  draw  an  analogy  between  a 
relevant  instrument  and  a  large  sample.  The  more  relevant  the  instrument,  i.e. ,  the  more  the 
variation  in  the  right  hand  side  endogenous  variable  that  is  explained  by  this  instrument,  the 
more  accurate  the  resulting  estimator.  This  is  similar  to  the  observation  that  the  larger  the 
sample  size,  the  more  accurate  the  estimator.  They  argue  that  the  instruments  should  not  just 
be  relevant,  but  highly  relevant  if  the  normal  distribution  is  to  provide  a  good  approximation  to 
the  sampling  distribution  of  2SLS.  Weak  instruments  explain  little  of  the  variation  in  the  right 
hand  side  endogenous  variable  they  are  instrumenting.  This  renders  the  normal  distribution  as  a 
poor  approximation  to  the  sampling  distribution  of  2SLS,  even  if  the  sample  size  is  large.  Stock 
and  Watson  (2003)  suggest  a  simple  rule  of  thumb  to  check  for  weak  instruments.  If  there  is  one 
right  hand  side  endogenous  variable,  the  first-stage  regression  can  test  for  the  significance  of  the 
excluded  exogenous  variables  (or  instruments)  using  an  F-statistic.  This  first-stage  F-statistic 
should  be  larger  than  10. 2  Stock  and  Watson  (2003)  suggest  that  a  first-stage  F-statistic  less 
than  10  indicates  weak  instruments  which  casts  doubt  on  the  validity  of  2SLS,  since  with  weak 
instruments,  2SLS  will  be  biased  even  in  large  samples  and  the  corresponding  f-statistics  and 
confidence  intervals  will  be  unreliable.  Finding  weak  instruments,  one  can  search  for  additional 
stronger  instruments,  or  use  alternative  estimators  than  2SLS  which  are  less  sensitive  to  weak 
instruments  like  LIML.  Deaton  (1997,  p.  112)  argues  that  it  is  difficult  to  find  instruments  that 
are  exogenous  while  at  the  same  time  highly  correlated  with  the  endogenous  variables  they 
are  instrumenting.  He  argues  that  it  is  easy  to  generate  2SLS  estimates  that  are  different  from 
OLS  but  much  harder  to  make  the  case  that  these  2SLS  estimates  are  necessarily  better  than 
OLS.  “Credible  identification  and  estimation  of  structural  equations  almost  always  requires 
real  creativity,  and  creativity  cannot  be  reduced  to  a  formula.”  Stock  and  Watson  (2003,  p. 
371)  show  that  for  the  case  of  a  single  right  hand  side  endogenous  variable  with  no  included 
exogenous  variables  and  one  weak  instrument,  the  distribution  of  the  2SLS  estimators  is  non¬ 
normal  even  for  large  samples,  with  the  mean  of  the  sampling  distribution  of  the  2SLS  estimator 
approximately  equal  to  the  true  coefficient  plus  the  asymptotic  bias  of  the  OLS  estimator  divided 
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by  ( E(F )  —  1)  where  F  is  the  first-stage  F-statistic.  If  F(F)  =  10,  then  the  large  sample  bias 
of  2SLS  is  (1/9)  that  of  the  large  sample  bias  of  OLS.  They  argue  that  this  rule  of  thumb  is 
an  acceptable  cutoff  for  most  empirical  applications. 

2SLS  is  a  single  equation  estimator.  The  focus  is  on  a  particular  equation.  [y\  ,Y\.X\]  is  spec¬ 
ified  and  therefore  all  that  is  needed  to  perform  2SLS  is  the  matrix  X  of  all  exogenous  variables 
in  the  system.  If  a  researcher  is  interested  in  a  particular  behavioral  economic  relationship  which 
may  be  a  part  of  a  big  model  consisting  of  several  equations,  one  need  not  specify  the  whole 
model  to  perform  2SLS  on  that  equation,  all  that  is  needed  is  the  matrix  of  all  exogenous  vari¬ 
ables  in  that  system.  Empirical  studies  involving  one  structural  equation,  specify  which  right 
hand  side  variables  are  endogenous  and  proceed  by  estimating  this  equation  via  an  IV  procedure 
that  usually  includes  all  the  feasible  exogenous  variables  available  to  the  researcher.  If  this  set 
of  exogenous  variables  does  not  include  all  the  A’s  in  the  system,  this  estimation  method  is  not 
2SLS.  However,  it  is  a  consistent  IV  method  which  we  will  call  feasible  2SLS. 

Substituting  (11.34)  in  (11.36),  we  get 

61,2  sls  =  Si  +  {ZiPxZj-'ZiPxU!  (11.42) 

with  plim  Si^sls  =  <5i  and  an  asymptotic  variance  covariance  matrix  given  by  an  plirn 
(Z[PxZi/T)~1 .  a ii  is  estimated  from  the  2SLS  residuals  u\  =  y\  —  ZiSi^sls,  by  comput¬ 
ing  sn  =  u^ui/iT  —  gi  —  k\).  It  is  important  to  emphasize  that  sn  is  obtained  from  the  2SLS 
residuals  of  the  original  equation  (11.34),  not  (11.35).  In  other  words,  sq  is  not  the  mean 
squared  error  (i.e. ,  s 2)  of  the  second  stage  regression  given  in  (11.35).  The  latter  regression  has 
Y\  in  it  and  not  Y\.  Therefore,  the  asymptotic  variance  covariance  matrix  of  2SLS  can  be  esti¬ 
mated  by  sn(Z[PxZi)~ 1  =  sn(Z[Zi)^1 .  The  t-statistics  reported  by  2SLS  packages  are  based 
on  the  standard  errors  obtained  from  the  square  root  of  the  diagonal  elements  of  this  matrix. 
These  standard  errors  and  t-statistics  can  be  made  robust  for  heteroskedasticity  by  computing 
(. Z'1Zi)~1(Z'1diag[uf]Zi)(Z'1Zi)~ 1  where  denotes  the  i-th  2SLS  residual.  Wald  type  statistics 

for  FI0\R8 1  =  r  based  on  2SLS  estimates  of  <5i  can  be  obtained  as  in  equation  (7.41)  with 
<5i,2 sls  replacing  (3OLs  and  vartfipsLs)  =  sii(Z(Zi)_1  replacing  var(J3OLS )  =  snpT'V)-1. 
This  can  be  made  robust  for  heteroskedasticity  by  using  the  robust  variance  covariance  matrix 
of  <5i,2 sls  described  above.  The  resulting  Wald  statistic  is  asymptotically  distributed  as 
under  the  null  hypothesis,  with  q  being  the  number  of  restrictions  imposed  by  R6 1  =  r. 

LM  type  tests  for  exclusion  restrictions,  like  a  subset  of  <5i  set  equal  to  zero  can  be  performed 
by  running  the  restricted  2SLS  residuals  on  the  matrix  of  unrestricted  second  stage  regressors  Z\ . 
The  test  statistic  is  given  by  TR2a  where  R ^  denotes  the  uncentered  R2.  This  is  asymptotically 
distributed  as  Xq  under  the  null  hypothesis,  where  q  is  the  number  of  coefficients  in  <5i  set  equal 
to  zero.  Note  that  it  does  not  matter  whether  the  exclusion  restrictions  are  imposed  on  /31  or 
ai,  i.e.,  whether  the  excluded  variables  to  be  tested  are  endogenous  or  exogenous.  An  F-test  for 
these  exclusion  restrictions  can  be  constructed  based  on  the  restricted  and  unrestricted  residual 
sums  of  squares  from  the  second  stage  regression.  The  denominator  of  this  F-statistic,  however, 
is  based  on  the  unrestricted  2SLS  residual  sum  of  squares  as  reported  by  the  2SLS  package.  Of 
course,  one  has  to  adjust  the  numerator  and  denominator  by  the  appropriate  degrees  of  freedom. 
Under  the  null,  this  is  asymptotically  distributed  as  F(q,  T  —  (g\  +  k±)).  See  Wooldridge  (1990) 
for  details.  Also,  see  the  over-identification  test  in  Section  11.5. 

Finite  sample  properties  of  2SLS  are  model  specific,  see  Mariano  (2001)  for  a  useful  summary. 
One  important  result  is  that  the  absolute  moments  of  positive  order  for  2SLS  are  finite  up  to 
the  order  of  over-identification.  So,  for  the  2SLS  estimator  to  have  a  mean  and  variance,  we 
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need  the  degree  of  over-identification  to  be  at  least  2.  This  also  means  that  for  a  just-identified 
model,  no  moments  for  2SLS  exist.  For  2SLS,  the  absolute  bias  is  an  increasing  function  of  the 
degree  of  over-identification.  For  the  case  of  one  right  hand  side  included  endogenous  regressor, 
like  equation  (11.25),  the  size  of  OLS  bias  relative  to  2SLS  gets  larger,  the  lower  the  degree  of 
over-identification,  the  bigger  the  sample  size,  the  higher  the  absolute  value  of  the  correlation 
between  the  disturbances  and  the  endogenous  regressor  1/2  and  the  higher  the  concentration 
parameter  p2.  The  latter  is  defined  as  p2  =  E{y2)'{Px  ~  Px \)E{U2) /u2  and  ui2  =  var(y2t)-  In 
terms  of  MSE,  larger  values  of  p2  and  large  sample  size  favor  2SLS  over  OLS. 

Another  important  single  equation  estimator  is  the  Limited  Information  Maximum  Likelihood 
(LIML)  estimator  which  as  the  name  suggests  maximizes  the  likelihood  function  pertaining  to 
the  endogenous  variables  appearing  in  the  estimated  equation  only.  Excluded  exogenous  vari¬ 
ables  from  this  equation  as  well  as  the  identifiability  restrictions  on  other  equations  in  the 
system  are  disregarded  in  the  likelihood  maximization.  For  details,  see  Anderson  and  Rubin 
(1950).  LIML  is  invariant  to  the  normalization  choice  of  the  dependent  variable  whereas  2SLS 
is  not.  This  invariancy  of  LIML  is  in  the  spirit  of  a  simultaneous  equation  model  where  nor¬ 
malization  should  not  matter.  Under  just-identification  2SLS  and  LIML  are  equivalent.  LIML 
is  also  known  as  the  Least  Variance  Ratio  (LVR)  method,  since  the  LIML  estimates  can  be 
obtained  by  minimizing  a  ratio  of  two  variances  or  equivalently  the  ratio  of  two  residual  sum  of 
squares.  Using  equation  (11.34),  one  can  write 

2/1=2/ 1  -  Li  a  =  XiP1  +  m 

For  a  choice  of  aq  one  can  compute  y\  and  regress  it  on  A'i  to  get  the  residual  sum  of  squares 
RSS±.  Now  regress  y\  on  X\  and  X2  and  compute  the  residual  sum  of  squares  RSS2-  Equation 
(11.34)  states  that  X2  does  not  enter  the  specification  of  that  equation.  In  fact,  this  is  where 
our  identifying  restrictions  come  from  and  the  excluded  exogenous  variables  that  are  used  as 
instrumental  variables.  If  these  identifying  restrictions  are  true,  adding  X2  to  the  regression 
of  yl  and  Aq  should  lead  to  minimal  reduction  in  RSS\.  Therefore,  the  LVR  method  finds 
the  aq  that  will  minimize  the  ratio  (RSSx/RSSf)-  After  ot\  is  estimated,  /31  is  obtained  from 
regressing  y\  on  Aq.  In  contrast,  it  can  be  shown  that  2SLS  minimizes  RSS\  —  RSS2-  For 
details,  see  Johnston  (1984)  or  Mariano  (2001).  Estimator  bias  is  less  of  a  problem  for  LIML 
than  2SLS.  In  fact  as  the  number  of  instruments  increase  with  the  sample  size  such  that  their 
ratio  is  a  constant,  Bekker  (1994)  shows  that  2SLS  becomes  inconsistent  while  LIML  remains 
consistent.  Both  estimators  are  special  cases  of  the  following  estimator: 

?i  =  {Z'lPxZl-'QZ[Zl)-1(ZllPxyl-'dZ,1y1) 

with  0  =  0  yielding  2SLS,  and  6  =  the  smallest  eigenvalue  of  {(D'1Di)^1  D[PxDi}  yielding 
LIML,  where  D\  =  [yi,Zi\. 

Example  3:  Simple  Keynesian  Model 

For  the  data  from  the  Economic  Report  of  the  President,  given  in  Table  5.3,  consider  the  simple 
Keynesian  model  with  no  government 

Ct  =  a  +  (3Yt  +  Ut  t  =  1 ,  2, . . . ,  T 


with  Yt  =  Ct  +  It . 
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Table  11.1  Two-Stage  Least  Squares 


Dependent  Variable:  CONSUMP 

Method:  Two-Stage  Least  Squares 

Sample:  1959  2007 

Included  observations:  49 

Instrument  specification:  INV 

Constant  added  to  instrument  list 

Variable 

Coefficient 

Std.  Error  t-Statistic 

Prob. 

C 

4081.653 

3194.839  1.277577 

0.2077 

Y 

0.685609 

0.172415  3.976513 

0.0002 

R-squared 

0.904339 

Mean  dependent  var 

16749.10 

Adjusted  R-squared 

0.902304 

S.D.  dependent  var 

5447.060 

S.E.  of  regression 

1702.552 

Sum  squared  resid 

1.36E+08 

F-statistic 

15.81265 

Durbin- Watson  stat 

0.014554 

Prob  (F-statistic) 

0.000240 

Second-Stage  SSR 

1.38E+09 

J-statistic 

0.000000 

Instrument  rank 

2 

The  OLS  estimates  of  the  consumption  function  yield: 

Ct  =  —1343.31  +  0.979  Yt  +  residuals 
(219.56)  (0.011) 

The  2SLS  estimates  assuming  that  It  is  exogenous  and  is  the  only  instrument  available,  yield 

Ct  =  4081.65  +  0.686  Yt  +  residuals 
(3194.8)  (0.172) 

Table  11.1  reports  these  2SLS  results  using  EViews.  Note  that  the  OLS  estimate  of  the  intercept 
is  understated,  while  that  of  the  slope  estimate  is  overstated  indicating  positive  correlation 
between  Yt  and  the  error  as  described  in  (11.6).  The  standard  errors  of  2SLS  are  bigger  than 
those  of  OLS.  This  is  always  the  case  for  an  instrumental  variable  estimator  as  will  be  shown 
analytically  for  a  simple  regression  in  Example  4  below. 

OLS  on  the  reduced  form  equations  yield 

Ct  =  12982.72+  2.18  It  +  residuals  and  1)  =  12982.72  +  3.18  It  +  residuals 
(3110.4)  (1.74)  (3110.4)  (1.74) 

From  example  (A. 5)  in  the  Appendix,  we  see  that  /3  =  ^12/^22  =  2.18/3.18  =  0.686  as  described 
in  (A. 24).  Also,  (3  =  (tt 22  —  1)/tt22  =  (3.18  —  1)/3.18  =  2.18/3.18  =  0.686  as  described  in  (A. 25). 
Similarly,  a  =  ^11/^22  =  ^21/^22  =  12982.72/3.18  =  4081.65  as  described  in  (A. 22). 

This  confirms  that  under  just-identification,  the  2SLS  estimates  of  the  structural  coefficients 
are  identical  to  the  Indirect  Least  Squares  (ILS)  estimates.  The  latter  estimates  uniquely  solve 
for  the  structural  parameter  estimates  from  the  reduced  form  estimates  under  just-identification. 
Note  that  in  this  case  both  2SLS  and  ILS  estimates  of  the  consumption  equation  are  identical 
to  the  simple  IV  estimator  using  It  as  an  instrument  for  Yj;  i.e. ,  /3IV  =  mCi/myi  as  shown  in 
(A. 24). 
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11.2.1  Spatial  Lag  Dependence 

An  alternative  popular  model  for  spatial  lag  dependence  considered  in  Section  9.9  is  given  by: 
y  =  pWy  +  A/3  +  e 

where  e  ~  IIN(0,  a2),  see  Anselin  (1988).  Here  yi  may  denote  output  in  region  i  which  is  affected 
by  output  of  its  neighbors  through  the  spatial  coefficient  p  and  the  weight  matrix  W.  Recall 
from  section  9.9,  W  is  a  known  weight  matrix  with  zero  elements  along  its  diagonal.  It  could 
be  a  contiguity  matrix  having  elements  1  if  its  a  neighboring  region  and  zero  otherwise.  Usually 
this  is  normalized  such  that  each  row  sums  to  1.  Alternatively,  W  could  be  based  on  distances 
from  neighbors  again  normalized  such  that  each  row  sums  to  1.  It  is  clear  that  the  presence  of 
Wy  as  a  regressor  introduces  endogeneity.  Assuming  ( In  —  pW )  nonsingular,  one  can  solve  for 
the  reduced  form  model: 

y=(In~pW)~1XP  +  e * 

where  e*  =  ( In  —  pW)~le  has  mean  zero  and  variance  covariance  matrix  which  has  the  same 
form  as  (9.38),  i.e. , 

£  =  E(e*e*')  =  a2n  =  a2(In  -  pW)~\ln  -  pW')~l 

For  |p |  <  1,  one  obtains 

(In  ~  pW)-1  =  In  +  pW  +  p2W2  +  p3W3  +  ... 

Hence 

E(y/X)  =  (In  -  pW)~1Xp  =  A/3  +  pWX[3  +  p2W2X/3  +  p3W3Xf3  +  ... 

This  also  means  that 

E(Wy/X)  =  W(In  -  PW)~1Xf3  =  WX(3  +  pW2Xf3  +  p2W3Xf3  +  p3WAX(3  +  ... 

Based  on  this  last  expression,  Kelejian  and  Robinson  (1993)  and  Kelejian  and  Prucha  (1998) 
suggest  the  use  of  a  subset  of  the  following  instrumental  variables: 

{X,  WX,  W2X,  IF3 A,  W4X, ...} 

Lee  (2003)  suggested  using  the  optimal  instrument  matrix: 

{X,  W (In  —  pVF)_1X/3} 

where  the  values  for  p  and  (3  are  obtained  from  a  first  stage  IV  estimator,  using  {X,  WX} 
as  instruments,  possibly  augmented  with  W2X.  Note  that  Lee’s  (2003)  instruments  involve 
inverting  a  matrix  of  dimension  n.  Kelejian,  et  al.  (2004)  suggest  an  approximation  based  upon: 

{A,  £  psWs+1XP} 

s=0 

where  r,  the  highest  order  of  this  approximation  depends  upon  the  sample  size,  with  r  =  o(n1^2). 
In  their  Monte  Carlo  experiments,  they  set  r  =  nc  where  c  =  0.25,0.35,  and  0.45.  This  is  a 
natural  application  of  2SLS  to  deal  with  the  problem  of  spatial  lag  dependence. 
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11.3  System  Estimation:  Three-Stage  Least  Squares 

If  the  entire  simultaneous  equations  model  is  to  be  estimated,  then  one  should  consider  system 
estimators  rather  than  single  equation  estimators.  System  estimators  take  into  account  the  zero 
restrictions  in  every  equation  as  well  as  the  variance-covariance  matrix  of  the  disturbances  of 
the  whole  system.  One  such  system  estimator  is  Three-Stage  Least  Squares  (3SLS)  where  the 
structural  equations  are  stacked  on  top  of  each  other,  just  like  a  set  of  SUR  equations, 

y  =  Z6  +  u  (11.43) 

where 
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and  u  has  zero  mean  and  variance-covariance  matrix  Y,<Z>It,  indicating  the  possible  correlation 
among  the  disturbances  of  the  different  structural  equations.  £  =  [cr,j],  with  E(uiU '■)  =  , 

for  i,j  =  1,2, ...  ,G.  This  <S>  notation  was  used  in  Chapter  9  and  defined  in  the  Appendix 
to  Chapter  7.  Problem  4  shows  that  premultiplying  the  i-th  structural  equation  by  X'  and 
performing  GLS  on  the  transformed  equation  results  in  2SLS.  For  the  system  given  in  (11.43), 
the  analogy  is  obtained  by  premultiplying  by  (IG®X'),  i.e. ,  each  equation  by  X’ ,  and  performing 
GLS  on  the  whole  system.  The  transformed  error  ( IG  <g>  X')u  has  a  zero  mean  and  variance- 
covariance  matrix  £  <S>  (X'X).  Hence,  GLS  on  the  entire  system  obtains 

'Sgls  =  {Z'(Ig  ®  ^)[£_1  <8>  (X1  X)~l](IG  <8>  X')Z}~1 

{Z\IG  ®  Ajp"1  ®  (X'X)-l]{IG  <g>  X')y  (11.44) 

which  upon  simplifying  yields 

% GLS  =  {Z’p-1  ®  PX}Z}-1{Z'[E-1  <8>  Px}y}  (11.45) 

£  has  to  be  estimated  to  make  this  estimator  operational.  Zellner  and  Theil  (1962),  suggest 
getting  the  2SLS  residuals  for  the  i-th  equation,  say  Ui  =  yi  —  Zjbi^SLS  and  estimating  £  by 
£  =  [Sij\  where 

dij  =  [u'iuj/{T  -  gi-ki)1/2{T  -  gj -kj)l/2}  for  i,  j  =  1, 2, . . . ,  G. 

If  £  is  substituted  for  £  in  (11.45),  the  resulting  estimator  is  called  3SLS: 

hsLS  =  {Z’lt-1  ®  Px)Z}-'{Z'$rx  <g)  Px]y}  (11.46) 

The  asymptotic  variance-covariance  matrix  of  63 sls  can  be  estimated  by  <g)  Px]Z}~1. 

If  the  system  of  equations  (11.43)  is  properly  specified,  3SLS  is  more  efficient  than  2SLS.  But 
if  say,  the  second  equation  is  improperly  specified  while  the  first  equation  is  properly  specified, 
then  a  system  estimator  like  3SLS  will  be  contaminated  by  this  misspecification  whereas  a  single 
equation  estimator  like  2SLS  on  the  first  equation  is  not.  So,  if  the  first  equation  is  of  interest 
it  does  not  pay  to  go  to  a  system  estimator  in  this  case. 
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Two  sufficient  conditions  exist  for  the  equivalence  of  2SLS  and  3SLS,  these  are  the  following: 
(i)  X  is  diagonal,  and  (ii)  every  equation  is  just  identified.  Problem  5  leads  you  step  by  step 
through  these  results.  It  is  also  easy  to  show,  see  problem  5,  that  a  necessary  and  sufficient 
condition  for  3SLS  to  be  equivalent  to  2SLS  on  each  equation  is  given  by 

c TijZ[P =0  for  i,  j  =  1,  2, . . . ,  G 

where  Z%  =  Px Z% ■  see  Baltagi  (1989).  This  is  similar  to  the  condition  derived  in  the  seemingly 
unrelated  regressions  case  except  it  involves  the  set  of  second  stage  regressors  of  2SLS.  One  can 
easily  see  that  besides  the  two  sufficient  conditions  given  above,  Z[P^  =  0  states  that  the  set 
of  second  stage  regressors  of  the  i-th  equation  have  to  be  a  perfect  linear  combination  of  those 
in  the  j-th  equation  and  vice  versa.  A  similar  condition  was  derived  by  Kapteyn  and  Fiebig 
(1981).  If  some  equations  in  the  system  are  over-identified  while  others  are  just-identified,  the 
3SLS  estimates  of  the  over-identified  equations  can  be  obtained  by  running  3SLS  ignoring  the 
just-identified  equations.  The  3SLS  estimates  of  each  just-identified  equation  differ  from  those  of 
2SLS  by  a  vector  which  is  a  linear  function  of  the  3SLS  residuals  of  the  over-identified  equations, 
see  Theil  (1971)  and  problem  17. 


11.4  Test  for  Over-Identification  Restrictions 


We  emphasized  instrument  relevance,  now  we  turn  to  instrument  exogeneity.  Under  just- 
identification,  one  cannot  statistically  test  instruments  for  exogeneity.  This  choice  of  exogenous 
instruments  requires  making  an  expert  judgement  based  on  knowledge  of  the  empirical  applica¬ 
tion.  However,  if  the  first  structural  equation  is  over-identified,  i.e. ,  the  number  of  instruments 
t  is  larger  than  the  number  of  right  hand  side  variables  ( g\  +  k\),  then  one  can  test  these  over¬ 
identifying  restrictions.  A  likelihood  ratio  test  for  this  over-identification  condition  based  on 
maximum  likelihood  procedures  was  given  by  Anderson  and  Rubin  (1950).  This  version  of  the 
test  requires  the  computation  of  LIML.  This  was  later  modified  by  Basmann  (1960)  so  that  it 
could  be  based  on  the  2SLS  procedure.  Here  we  present  a  simpler  alternative  based  on  Davidson 
and  MacKinnon  (1993)  and  Hausman  (1983).  In  essence,  one  is  testing 

Ha;  yi  =  Z\8\  +  u\  versus  H\;  y\  =  Z\b\  +  +  u\  (11.47) 


where  u\  ~  IID(0,  (JuIt )•  Let  W  be  the  matrix  of  instruments  of  full  rank  l.  Also,  let  W*  be  a 
subset  of  instruments  W,  of  dimension  (£—ki  —gi),  that  are  linearly  independent  of  Z\  =  P\yZ\  . 
In  this  case,  the  matrix  [Z\,W*\  has  full  rank  i  and  therefore,  spans  the  same  space  as  W.  A 
test  for  over-identification  is  a  test  for  7  =  0.  In  other  words,  W*  has  no  ability  to  explain  any 
variation  in  y\  that  is  not  explained  by  Z\  using  the  matrix  of  instruments  W. 

If  W*  is  correlated  with  u\  or  the  first  structural  equation  (11.34)  is  misspecified,  say  by  Z\ 
not  including  some  variables  in  W* ,  then  7  7^  0.  Hence,  testing  7  =  0  should  be  interpreted  as  a 
joint  test  for  the  validity  of  the  matrix  of  instruments  W  and  the  proper  specification  of  (11.34) 
see,  Davidson  and  MacKinnon  (1993).  Testing  H0\  7  =  0  can  be  obtained  as  an  asymptotic 
F-test  as  follows: 


{RRSS*  -  URSS*)/(£  -  ( gi  +  h)) 
URSS /{T  -  £) 


(11.48) 
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This  is  asymptotically  distributed  as  F{£  —  (g±  +  k\),T  —  £)  under  Ha.  Using  instruments  W, 
we  regress  Z\  on  W  and  get  Z\ .  then  obtain  the  restricted  2SLS  estimate  8\,2SLS  by  regressing 
yi  on  Z\ .  The  restricted  residual  sum  of  squares  from  the  second  stage  regression  is  RRSS*  = 
(yi  —  ZiSi^sls)' (yi  ~  Zi6it2SLs)-  Next  we  regress  2/1  on  Z\  and  W*  to  get  the  unrestricted  2SLS 
estimates  Si^sls  and  12SLS-  The  unrestricted  residual  sum  of  squares  from  the  second  stage 
regression  is  URSS*  =  (2/1  -  z{b])2SLS  ~  ^X*l2SLs)\yi  ~  ^>1,2 SLS  ~  W*^2sls)-  The  URSS 
in  (11.48)  is  the  2SLS  residuals  sum  of  squares  from  the  unrestricted  model  which  is  obtained 
as  follows  (yi  -  Z\6\  2SLS^  W*%SLS)'(yi  -  Z161:2sls  ~  W*^2SLS).  URSS  differs  from  URSS* 
in  that  Z\  rather  than  Z\  is  used  in  obtaining  the  residuals.  Note  that  this  differs  from  the 
Chow-test  in  that  the  denominator  is  not  based  on  URSS*,  see  Wooldridge  (1990). 

This  test  does  not  require  the  construction  of  W*  for  its  implementation.  This  is  because  the 
model  under  H 1  is  just-identified  with  as  many  regressor  as  there  are  instruments.  This  means 
that  its 

URSS*  =  y\ Pwy\  =  ?/i ?/i  -  y'iPwyi 
see  problem  10.  It  is  easy  to  show,  see  problem  12,  that 
RRSS*  =  y'\Pzly\  =  2/1 2/1  -  y'iPwyi 
where  Z\  =  PwZ\  ■  Hence, 

RRSS*  -  URSS*  =  y\Pwy\  ~  y'lP^Vi  (11-49) 

The  test  for  over-identification  can  therefore  be  based  on  RRSS*  —  URSS*  divided  by  a  con¬ 
sistent  estimate  of  an,  say, 

<711  =  (2/1  -  Zlbl,2SLs)'{yi  -  Z181 ,2SLs)/T  (11.50) 

Problem  12  shows  that  the  resulting  test  statistic  is  exactly  that  proposed  by  Hausman  (1983). 
In  a  nutshell,  the  Hausman  over-identification  test  regresses  the  2SLS  residuals  2/1  —  Z\8it2SLS 
on  the  matrix  W  of  all  pre-determined  variables  in  the  model.  The  test  statistic  is  T  times  the 
uncentered  R?  of  this  regression.  See  the  Appendix  to  Chapter  3  for  a  definition  of  uncentered 
R2.  This  test  statistic  is  asymptotically  distributed  as  x2  with  £  —  (gi  +  ki)  degrees  of  freedom. 
Large  values  of  this  statistic  reject  the  null  hypothesis. 

Alternatively,  one  can  get  this  test  statistic  as  a  Gauss-Newton  Regression  (GNR)  on  the 
unrestricted  model  in  (11.47).  To  see  this,  recall  from  section  8.4  that  the  GNR  applies  to  a 
general  nonlinear  model  yt  =  xt{(3)  +  ut ■  Using  the  set  of  instruments  W,  the  GNR  becomes 

y  —  x($)  =  PwX{(d)b  +  residuals 

where  (3  denotes  the  restricted  instrumental  variable  estimate  of  (3  under  the  null  hypothesis  and 
X(/3)  is  the  matrix  of  derivatives  with  typical  elements  Xij(/3 )  =  dxi(/3)/d/3j  for  j  =  1, ...  ,k. 
Thus,  the  only  difference  between  this  GNR  and  that  in  Chapter  8  is  that  the  regressors  are 
multiplied  by  P\y ,  see  Davidson  and  MacKinnon  (1993,  p.  226).  Therefore,  the  GNR  for  (11.47) 
yields 


2/i  -  ZiSi^sls  =  Z\bi  +  W*b2  +  residuals 


(11.51) 
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since  Pw[Zi,W*]  =  [Zi,W*J  and  6it2SLS  is  the  restricted  estimator  under  Ha;  7  =  0.  But, 
[Z 1,  W*}  spans  the  same  space  as  W .  see  problem  12.  Hence,  the  GNR  in  (11.51)  is  equivalent 
to  running  the  2SLS  residuals  on  W  and  computing  T  times  the  uncentered  R2  as  described 
above.  Once  again,  it  is  clear  that  W*  need  not  be  constructed. 

The  basic  intuition  behind  the  test  for  over-identification  restriction  rests  on  the  fact  that 
one  can  compute  several  legitimate  IV  estimators  if  all  these  instruments  are  relevant  and 
exogenous.  For  example,  suppose  there  are  two  instruments  and  one  right  hand  side  endogenous 
variable.  Then  one  can  compute  two  IV  estimators  using  each  instrument  separately.  If  these 
IV  estimators  produce  very  different  estimates,  then  may  be  one  instrument  or  the  other  or 
both  are  not  exogenous.  The  over-identification  test  we  just  described  implicitly  makes  this 
comparison  without  actually  computing  all  possible  IV  estimates.  Exogenous  instruments  have 
to  be  uncorrelated  with  the  disturbances.  This  suggests  that  the  2SLS  residuals  have  to  be 
uncorrelated  with  the  instruments.  This  is  the  basis  for  the  TR 2  test  statistic.  If  all  the 
instruments  are  exogenous,  the  regression  coefficient  estimates  should  all  be  not  significantly 
different  from  zero  and  the  R2a  should  be  low. 


11.5  Hausman’s  Specification  Test3 

A  critical  assumption  for  the  linear  regression  model  y  =  X/3  +  u  is  that  the  set  of  regressors 
X  are  uncorrelated  with  the  error  term  u.  Otherwise,  we  have  simultaneous  bias  and  OLS  is 
inconsistent.  Hausman  (1978)  proposed  a  general  specification  test  for  H0 ;  E{u/X )  =  0  versus 
H 1;  E(u/X )  7^  0.  Two  estimators  are  needed  to  implement  this  test.  The  first  estimator  must 
be  a  consistent  and  efficient  estimator  of  f3  under  H0  which  becomes  inconsistent  under  H\.  Let 
us  denote  this  efficient  estimator  under  H0  by  fd0.  The  second  estimator,  denoted  by  /31:  must 
be  consistent  for  f3  under  both  Ha  and  H\,  but  inefficient  under  Ha.  Hausman’s  test  is  based 
on  the  difference  between  these  two  estimators  q  =  f31  —  /30.  Under  H0;  plim  q  is  zero,  while 
under  H\\  plim  g  /  0.  Hausman  (1978)  shows  that  var(g)  =  var(/31)—  var(/30)  and  Hausman’s 
test  becomes 

m  =  §/[var(g)]_1q  (11.52) 

which  is  asymptotically  distributed  under  Ha  as  x2  where  k  is  the  dimension  of  /?. 

It  remains  to  show  that  var(g)  is  the  difference  between  the  two  variances.  This  can  be 
illustrated  for  a  single  regressor  case  without  matrix  algebra,  see  Maddala  (1992,  page  507). 
First,  one  shows  that  cov(/30,g)  =  0.  To  prove  this,  consider  a  new  estimator  of  /?  defined  as 

/3  =  (30  +  \q  where  A  is  an  arbitrary  constant.  Under  Ha ,  plim  /3  =  (5  for  every  A  and 
var(/3)  =  var(/30)  +  A2var(g)  +  2Acov(/3D,  q) 

Since  /30  is  efficient,  var(/3)  >  var(/30)  which  means  that  A2  var(g)+2A  cov(/30,  q)  >  0  for  every  A. 
If  co v(/3OJ§)  >  0,  then  for  A  =  —  cov(/30,  q)/vav(q)  the  above  inequality  is  violated.  Similarly,  if 
co v(/30,q)  <  0,  then  for  A=-cov(/3OJ  q)/vav(q)  the  above  inequality  is  violated.  Therefore,  under 
H0,  for  the  above  inequality  to  be  satisfied  for  every  A,  it  must  be  the  case  that  cov(/30,  q)  =  0. 

Now,  q  =  /31  —/30  can  be  rewritten  as  @1  =  q+(30  with  var(/3x)  =  var(g)+  var(/30)+2cov(g,  (30 ). 
Using  the  fact  that  the  last  term  is  zero,  we  get  the  required  result:  var(g)  =  var(/31)—  var(/30). 
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Example  4:  Consider  the  simple  regression  without  a  constant 
yt  =  f3xt  +  ut  t  =  l,2,...,T  with  ut  ~  IIN(0,  cr2) 


and  where  (5  is  a  scalar.  Under  H0\  E{ut/xt)  =  0  and  OLS  is  efficient  and  consistent.  Under 
H\ ;  OLS  is  not  consistent.  Using  wt  as  an  instrumental  variable,  with  u>t  uncorrelated  with  ut 
and  preferably  highly  correlated  with  xt ,  yields  the  following  IV  estimator  of  (3: 

^  rj~i  rj~i  rj~}  rj~i 

Piv  =  Et= i  ytwt/  Et=i  xtwt  =  (3  +  J2t= i  wtut/  Et= i  xtwt 


with  plirn  /3IV  =  /3  under  Ha  and  H\  and 


var@IV )  =  a2  i  Wt/  (ELi  xtwt )2 


where  r2xw  =  QCLi  Ef=i  “t  Ef=i  Als<b 

Pols  =  ELi  Ef=i  xt=P  +  Et=i  ®t“t/  Ef=i 

with  plirn  =  P  under  Ha  but  plirn  Pols  /  P  under  Hi,  with  var (Pols)  =  a2/Ylt= l  xt- 

One  can  see  that  var(/3OLS)  <  vai(/3IV)  since  0  <  r2w  <  1.  In  fact  for  a  weak  IV,  r2w  is  small 
and  close  to  zero  and  the  var((3IV)  blows  up.  Strong  IV  is  where  r2w  is  close  to  1  and  the  closer 
is  the  var(/3IV)  to  var  (Pols)-  Iu  this  case,  q  =  /3IV  —  Pols  and  plirn  q  /  0  under  H\ .  while 
plirn  q  =  0  under  H0,  with 

var (q)  =  var (J3IV)  -  vai0OLS)  =  — T — - 

Ei= i  xt 

Therefore,  Hausman’s  test  statistic  is 

™  =  frlu/vaiiJioLs)^  ~  r2xw) 


-  1 


=  vaT(PoLs) 


1  —  r 

'  xw 


which  is  asymptotically  distributed  as  Xi  under  H0.  Note  that  the  same  estimator  of  a2  is  used 
for  vav(Pjy)  and  var(/3OLS).  This  is  the  estimator  of  u2  obtained  under  H0. 

The  Hausman-test  can  also  be  obtained  from  the  following  augmented  regression: 


yt  =  Pxt  +  7  xt  +  et 


where  xt  is  the  predicted  value  of  xt  from  regressing  it  on  the  instrumental  variable  u>t .  Problem 
13  asks  the  reader  to  show  that  Hausman’s  test  statistic  can  be  obtained  by  testing  7  =  0. 

The  IV  estimator  assumes  that  wt  and  ut  are  uncorrelated.  If  this  is  violated  in  practice,  then 
the  IV  estimator  is  inconsistent  and  the  asymptotic  bias  can  be  aggravated  if  in  addition  this 
is  a  weak  instrument.  To  see  this, 


plim/3/y  =  P  +  plim(Et=l  wtut/ EL  1  xtwt)  =  P  + 


cov(wt,ut ) 
cov(xt,wt ) 


—  P  + 


corr(wt,ut)  ax 
corr(xt,wt )  07 


where  cov(wt,ut )  and  corr(xt,wt)  denote  the  population  covariance  and  correlation,  respec¬ 
tively.  Also,  cr x  and  au  denote  the  population  standard  deviations.  If  corr(wt,ut)  /  0,  there  is 
an  asymptotic  bias  in  /3jy-  If  corr(wt,ut )  is  small  and  the  instrument  is  strong  (with  a  large 
corr(xt,wt)),  this  bias  could  be  small.  However,  this  bias  could  be  large,  even  if  corr(wt,ut )  is 


11.5  Hausman’s  Specification  Test  277 


small,  in  the  case  of  weak  instruments  (small  corr(xt,wt)).  This  warns  researchers  of  using  an 
instrument  which  they  deem  much  better  than  xt  because  it  is  less  correlated  with  ut-  If  this 
instrument  is  additionally  weak,  the  bias  of  the  resulting  IV  estimator  will  be  enlarged  due  to 
the  smaller  corr(xt,wt).  In  sum,  weak  instruments  may  have  large  asymptotic  bias  in  practice 
even  if  they  are  slightly  correlated  with  the  error  term. 

In  matrix  form,  the  Durbin-Wu-Hausman  test  for  the  first  structural  equation  is  based  upon 
the  difference  between  OLS  and  IV  estimation  of  (11.34)  using  the  matrix  of  instruments  W . 
In  particular,  the  vector  of  contrasts  is  given  by 

q  =  61,iv-h,OLS  =  (Z[PWZ1)-1[Z[Pwy1-(Z[PwZ1)(Z[Z1)-1Z[y1 ]  (11.53) 

=  (Z'1PwZ1)-1[Z'1PwPZly1} 

Under  the  null  hypothesis,  q  =  ( Z [ P\y Z\)~] Z[ P\y Pz lu\.  The  test  for  q  =  0  can  be  based  on  the 
test  for  Z'1P\yPz1ui  having  mean  zero  asymptotically.  This  last  vector  is  of  dimension  (gi  +  k\). 
However,  not  all  of  its  elements  are  necessarily  random  variables  since  P^may  annihilate  some 
columns  of  the  second  stage  regressors  Z\  =  P\yZ\ .  In  fact,  all  the  included  Ws  which  are  part 
of  W,  i.e. ,  Xi,  will  be  annihilated  by  Pz1  ■  Only  the  g\  linearly  independent  variables  Y\  =  P\yY- 1 
are  not  annihilated  by  Pz1- 

Our  test  focuses  on  the  vector  Y1Pz1u±  having  mean  zero  asymptotically.  Now  consider  the 
artificial  regression 

yi  =  Z\8\  +  Y17  +  residuals  (11.54) 

Since  [Zi,Yi],  [Zi,Zi\,  [Z\,Z\  —  Z{\  and  [Z\ .  Y\  —  Y\]  all  span  the  same  column  space,  this 
regression  has  the  same  sum  of  squares  residuals  as 

yi  =  Z\8\  +  (Y\  —  Y\  )rj  +  residuals  (11.55) 

The  DWH-test  may  be  based  on  either  of  these  regressions.  It  is  equivalent  to  testing  7  =  0 
in  (11.54)  or  7  =  0  in  (11.55)  using  an  P-test.  This  is  asymptotically  distributed  as  F(gi,T  — 
2g\  —  k\).  Davidson  and  MacKinnon  (1993,  p.  239)  warn  about  interpreting  this  test  as  one  of 
exogeneity  of  Yi  (the  variables  in  Z\  not  in  the  space  spanned  by  W).  They  argue  that  what 
is  being  tested  is  the  consistency  of  the  OLS  estimates  of  £1,  not  that  every  column  of  Z\  is 
independent  of  u  \ . 

In  practice,  one  may  be  sure  about  using  W2  as  a  set  of  IV’s  but  is  not  sure  whether  some 
r  additional  variables  in  Z\  are  legitimate  as  instruments.  The  DWH-test  in  this  case  will  be 
based  upon  the  difference  between  two  IV  estimators  for  <5i.  The  first  is  82  jv  based  on  W2  and 
the  second  is  8\jv  based  on  W 1 .  The  latter  set  includes  W2  and  the  additional  r  variables  in  Z\ . 

h>2,iv  ~  Si, iv  =  (Z[Pw2Zi)~1[Z[PW2yi  -  (Z[PW2Zi)(Z[PWlZi)~1Z[PWlyi]  (11.56) 
=  ( Z[Pw2Zi)~1Z[PW2(PpWiZi)y1 

since  Pw^Pwx  =  P\v2-  The  DWH-test  is  based  on  this  contrast  having  mean  zero  asymptotically. 
Once  again  this  last  vector  has  dimension  g\  +  k\  and  not  all  its  elements  are  necessarily 
random  variables  since  PpWlZl  annihilates  some  columns  of  P\\,t2  Z \ .  This  test  can  be  based  on 
the  following  artificial  regression: 

yi  =  Z\8\  +  Pw2Z\ 7  +  residuals  (11.57) 

where  Pw2Z{  consists  of  the  r  columns  of  Pw2Z\  that  are  not  annihilated  by  Pwi-  Regression 
(11.57)  is  performed  with  W\  as  the  set  of  IV’s  and  7  =  0  is  tested  using  an  F-test. 
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11.6  Empirical  Examples 


Example  1:  Crime  in  North  Carolina.  Cornwell  and  Trumbull  (1994)  estimated  an  economic 
model  of  crime  using  data  on  90  counties  in  North  Carolina  observed  over  the  years  1981-87. 
This  data  set  is  available  on  the  Springer  web  site  as  CRIME.DAT.  Here,  we  consider  cross- 
section  data  for  1987  and  reconsider  the  full  panel  data  set  in  Chapter  12.  Table  11.2  gives  the 
OLS  estimates  relating  the  crime  rate  (which  is  an  FBI  index  measuring  the  number  of  crimes 
divided  by  the  county  population)  to  a  set  of  explanatory  variables.  This  was  done  using  Stata. 
All  variables  are  in  logs  except  for  the  regional  dummies.  The  explanatory  variables  consist  of  the 
probability  of  arrest  (which  is  measured  by  ratio  of  arrests  to  offenses),  probability  of  conviction 
given  arrest  (which  is  measured  by  the  ratio  of  convictions  to  arrests),  probability  of  a  prison 
sentence  given  a  conviction  (measured  by  the  proportion  of  total  convictions  resulting  in  prison 
sentences) ;  average  prison  sentence  in  days  as  a  proxy  for  sanction  severity.  The  number  of  police 
per  capita  as  a  measure  of  the  county’s  ability  to  detect  crime,  the  population  density  which 
is  the  county  population  divided  by  county  land  area,  a  dummy  variable  indicating  whether 
the  county  is  in  the  SMSA  with  population  larger  than  50,000.  Percent  minority,  which  is  the 


Table  11.2  Least  Squares  Estimates:  Crime  in  North  Carolina 


Source 

ss 

df 

MS 

Number  of  obs 
F(20,69) 

Prob  >  F 
R-squared 

Adj  R-squared 
Root  MSE 

90 

=  19.71 

=  0.0000 
=  0.8510 

=  0.8078 

=  .24054 

Model 

Residual 

22.8072483 

3.99245334 

20 

69 

1.14036241 

.057861643 

Total 

26.7997016 

89 

.301120243 

lcrmrte 

Coef. 

Std.  Err. 

t 

P>\t\ 

[95%  Conf.  Interval] 

lprbarr 

-.4522907 

.0816261 

-5.54 

0.000 

-.6151303 

-.2894511 

lprbconv 

-.3003044 

.0600259 

-5.00 

0.000 

-.4200527 

-.180556 

lprbpris 

-.0340435 

.1251096 

-0.27 

0.786 

-.2836303 

.2155433 

lavgsen 

-.2134467 

.1167513 

-1.83 

0.072 

-.4463592 

.0194659 

lpolpc 

.3610463 

.0909534 

3.97 

0.000 

.1795993 

.5424934 

ldensity 

.3149706 

.0698265 

4.51 

0.000 

.1756705 

.4542707 

lwcon 

.2727634 

.2198714 

1.24 

0.219 

-.165868 

.7113949 

lwtuc 

.1603777 

.1666014 

0.96 

0.339 

-.171983 

.4927385 

lwtrd 

.1325719 

.3005086 

0.44 

0.660 

-.4669263 

.7320702 

lwfir 

-.3205858 

.251185 

-1.28 

0.206 

-.8216861 

.1805146 

lwser 

-.2694193 

.1039842 

-2.59 

0.012 

-.4768622 

-.0619765 

lwmfg 

.1029571 

.1524804 

0.68 

0.502 

-.2012331 

.4071472 

lwfed 

.3856593 

.3215442 

1.20 

0.234 

-.2558039 

1.027123 

lwsta 

-.078239 

.2701264 

-0.29 

0.773 

-.6171264 

.4606485 

lwloc 

-.1774064 

.4251793 

-0.42 

0.678 

-1.025616 

.670803 

lpctymle 

.0326912 

.1580377 

0.21 

0.837 

-.2825855 

.3479678 

lpctmin 

.2245975 

.0519005 

4.33 

0.000 

.1210589 

.3281361 

west 

-.087998 

.1243235 

-0.71 

0.481 

-.3360167 

.1600207 

central 

-.1771378 

.0739535 

-2.40 

0.019 

-.3246709 

-.0296046 

urban 

-.0896129 

.1375084 

-0.65 

0.517 

-.3639347 

.184709 

_cons 

-3.395919 

3.020674 

-1.12 

0.265 

-9.421998 

2.630159 
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Table  11.3  Instrumental  Variables  (2SLS)  Regression:  Crime  in  North  Carolina 


Source 

ss 

df 

MS 

Number  of  obs 
F(20,69) 

Prob  >  F 
R-squared 

Adj  R-squared 
Root  MSE 

90 

=  17.35 

=  0.0000 
=  0.8446 

=  0.7996 

=  .24568 

Model 

Residual 

22.6350465 

4.16465515 

20 

69 

1.13175232 

.060357321 

Total 

26.7997016 

89 

.301120243 

lcrmrte 

Coef. 

Std.  Err. 

t 

P>\t\ 

[95%  Conf.  Interval] 

lprbarr 

-.4393081 

.2267579 

-1.94 

0.057 

-.8916777 

.0130615 

lpolpc 

.5136133 

.1976888 

2.60 

0.011 

.1192349 

.9079918 

lprbconv 

-.2713278 

.0847024 

-3.20 

0.002 

-.4403044 

-.1023512 

lprbpris 

-.0278416 

.1283276 

-0.22 

0.829 

-.2838482 

.2281651 

lavgsen 

-.280122 

.1387228 

-2.02 

0.047 

-.5568663 

-.0033776 

ldensity 

.3273521 

.0893292 

3.66 

0.000 

.1491452 

.505559 

lwcon 

.3456183 

.2419206 

1.43 

0.158 

-.137 

.8282366 

lwtuc 

.1773533 

.1718849 

1.03 

0.306 

-.1655477 

.5202542 

lwtrd 

.212578 

.3239984 

0.66 

0.514 

-.433781 

.8589371 

lwfir 

-.3540903 

.2612516 

-1.36 

0.180 

-.8752731 

.1670925 

lwser 

-.2911556 

.1122454 

-2.59 

0.012 

-.5150789 

-.0672322 

lwmfg 

.0642196 

.1644108 

0.39 

0.697 

-.263771 

.3922102 

lwfed 

.2974661 

.3425026 

0.87 

0.388 

-.3858079 

.9807402 

lwsta 

.0037846 

.3102383 

0.01 

0.990 

-.615124 

.6226931 

lwloc 

-.4336541 

.5166733 

-0.84 

0.404 

-1.464389 

.597081 

lpctymle 

.0095115 

.1869867 

0.05 

0.960 

-.3635166 

.3825397 

lpctmin 

.2285766 

.0543079 

4.21 

0.000 

.1202354 

.3369179 

west 

-.0952899 

.1301449 

-0.73 

0.467 

-.3549219 

.1643422 

central 

-.1792662 

.0762815 

-2.35 

0.022 

-.3314437 

-.0270888 

urban 

-.1139416 

.143354 

-0.79 

0.429 

-.3999251 

.1720419 

_cons 

-1.159015 

3.898202 

-0.30 

0.767 

-8.935716 

6.617686 

Instrumented:  lprbarr  lpolpc 

Instruments:  lprbconv  lprbpris  lavgsen  ldensity  lwcon  lwtuc  lwtrcl  lwfir  lwser  lwmfg  lwfed 

lwsta  lwloc  lpctymle  lpctmin  west  central  ltaxpc  lmix 


proportion  of  the  county’s  population  that  is  minority  or  non-white.  Percent  young  male  which 
is  the  proportion  of  the  county’s  population  that  is  males  and  between  the  ages  of  15  and  24. 
Regional  dummies  for  western  and  central  counties.  Opportunities  in  the  legal  sector  captured 
by  the  average  weekly  wage  in  the  county  by  industry.  These  industries  are:  construction; 
transportation,  utilities  and  communication;  wholesale  and  retail  trade;  finance,  insurance  and 
real  estate;  services;  manufacturing;  and  federal,  state  and  local  government. 

Results  show  that  the  probability  of  arrest  as  well  as  conviction  given  arrest  have  a  negative 
and  significant  effect  on  the  crime  rate  with  estimated  elasticities  of  —0.45  and  —0.30  respec¬ 
tively.  The  probability  of  imprisonment  given  conviction  as  well  as  the  sentence  severity  have  a 
negative  but  insignificant  effect  on  the  crime  rate.  The  greater  the  number  of  police  per  capita, 
the  greater  the  number  of  reported  crimes  per  capita.  The  estimated  elasticity  is  0.36  and  it  is 
significant.  This  could  be  explained  by  the  fact  that  the  larger  the  police  force,  the  larger  the 
reported  crime.  Alternatively,  this  could  be  an  endogeneity  problem  with  more  crime  resulting 
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Table  11.4  Hausman’s  Test:  Crime  in  North  Carolina 


Coefficients 

(b) 

b2sls 

(B) 

bols 

(b-B) 

Difference 

sqrt  (di  ag  ( V  _b- V  _B ) ) 
S.E. 

lprbarr 

-.4393081 

.4522907 

.0129826 

.2115569 

lpolpc 

.5136133 

.3610463 

.152567 

.1755231 

lprbconv 

-.2713278 

.3003044 

.0289765 

.0597611 

lprbpris 

-.0278416 

.0340435 

.0062019 

.0285582 

lavgsen 

-.280122 

.2134467 

-.0666753 

.0749208 

ldensity 

.3273521 

.3149706 

.0123815 

.0557132 

lwcon 

.3456183 

.2727634 

.0728548 

.1009065 

lwtuc 

.1773533 

.1603777 

.0169755 

.0422893 

lwtrd 

.212578 

.1325719 

.0800061 

.1211178 

lwfir 

-.3540903 

.3205858 

-.0335045 

.0718228 

lwser 

-.2911556 

.2694193 

-.0217362 

.0422646 

lwmfg 

.0642196 

.1029571 

-.0387375 

.0614869 

lwfed 

.2974661 

.3856593 

-.0881932 

.1179718 

lwsta 

.0037846 

-.078239 

.0820236 

.1525764 

lwloc 

-.4336541 

.1774064 

-.2562477 

.293554 

lpctymle 

.0095115 

.0326912 

-.0231796 

.0999404 

lpctmin 

.2285766 

.2245975 

.0039792 

.0159902 

west 

-.0952899 

-.087998 

-.0072919 

.0384885 

central 

-.1792662 

.1771378 

-.0021284 

.0187016 

urban 

-.1139416 

.0896129 

-.0243287 

.0405192 

b  =  consistent  under  Ho  and  Ha;  obtained  from  ivreg 
B  =  inconsistent  under  Ha,  efficient  under  Ho;  obtained  from  regress 
Test:  Ho:  difference  in  coefficients  not  systematic 


chi2(20)  =  (b— B)’[(V_b— VJ3)~(— l)](b— B) 
=  0.87 

Prob  >  chi2  =  1.0000 


in  the  hiring  of  more  police.  The  higher  the  density  of  the  population  the  higher  the  crime  rate. 
The  estimated  elasticity  is  0.31  and  it  is  significant.  Returns  to  legal  activity  are  insignificant 
except  for  wages  in  the  service  sector.  This  has  a  negative  and  significant  effect  on  crime  with 
an  estimated  elasticity  of  —0.27.  Percent  young  male  is  insignificant,  while  percent  minority 
is  positive  and  significant  with  an  estimated  elasticity  of  0.22.  The  central  dummy  variable  is 
negative  and  significant  while  the  western  dummy  variable  is  not  significant.  Also,  the  urban 
dummy  variable  is  insignificant.  Cornwell  and  Trumbull  (1994)  worried  about  the  endogeneity 
of  police  per  capita  and  the  probability  of  arrest.  They  used  as  instruments  two  additional  vari¬ 
ables.  Offense  mix  which  is  the  ratio  of  crimes  involving  face  to  face  contact  (such  as  robbery, 
assault  and  rape)  to  those  that  do  not.  The  rationale  for  using  this  variable  is  that  arrest  is 
facilitated  by  positive  identification  of  the  offender.  The  second  instrument  is  per  capita  tax 
revenue.  This  is  justified  on  the  basis  that  counties  with  preferences  for  law  enforcement  will 
vote  for  higher  taxes  to  fund  a  larger  police  force. 
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The  2SLS  estimates  are  reported  in  Table  11.3.  The  probability  of  arrest  has  an  estimated 
elasticity  of  —0.44  but  now  with  a  p-value  of  0.057.  The  probability  of  conviction  given  arrest 
has  an  estimated  elasticity  of  —0.27  still  significant.  The  probability  of  imprisonment  given 
conviction  is  still  insignificant  while  the  sentence  severity  is  now  negative  and  significant  with 
an  estimated  elasticity  of  —0.28.  Police  per  capita  has  a  higher  elasticity  of  0.51  still  signif¬ 
icant.  The  remaining  estimates  are  slightly  affected.  In  fact,  the  Hausman  test  based  on  the 
difference  between  the  OLS  and  2SLS  estimates  is  shown  in  Table  11.4.  This  is  computed  us¬ 
ing  Stata  and  it  contrasts  20  slope  coefficient  estimates.  The  Hausman  test  statistic  is  0.87 
and  is  asymptotically  distributed  as  xlo-  This  is  insignificant,  and  shows  that  the  2SLS  and 
OLS  estimates  are  not  significantly  different  given  this  model  specification  and  the  specific 
choice  of  instruments.  Note  that  this  is  a  just-identified  equation  and  one  cannot  test  for  over¬ 
identification. 

Note  that  the  2sls  estimates  and  the  Hausman  test  are  not  robust  to  heteroskedasticity. 
Table  11.5  gives  the  2sls  estimates  by  running  ivregress  in  Stata  with  the  robust  variance- 
covariance  matrix  option. 

Note  that  the  estimates  remain  the  same  as  Table  11.3  but  the  standard  errors  for  the  right 
hand  side  endogenous  variables  Y\  are  now  larger.  The  option  ( estat  endogenous )  generates  the 
Hausman  test  which  tests  whether  2sls  is  different  from  OLS  based  now  on  the  robust  variance- 
covariance  estimate.  The  F(2,67)  statistic  observed  is  0.455  and  has  a  p-value  0.636  which  is 
not  significant.  This  F-statistic  could  have  been  generated  by  an  artificial  regression  as  follows: 
Obtain  the  residuals  from  the  first  stage  regressions,  see  Tables  11.6  and  11.7  below.  Call  these 
residuals  vlhat  and  v2hat.  Include  them  as  additional  variables  in  the  original  equation  and  run 
robust  least  squares.  The  robust  Hausman  test  is  equivalent  to  testing  that  the  coefficients  of 
these  two  residuals  are  jointly  zero.  The  Stata  commands  (without  showing  the  output  of  this 
artificial  regression)  and  the  resulting  test  statistic  are  shown  below: 

.  quietly  regress  lcrmrte  lprbarr  lprbconv  lprbpris  lavgsen  lpolpc  ldensity 
lwcon  lwtuc  lwtrd  lwfir  lwser  lwmfg  lwfed  lwsta  lwloc  lpctymle  lpctmin  west 
central  urban  vlhat  v2hat  if  year==87,  vce (robust) 

.  test  vlhat  v2hat 

(1)  vlhat  =  0 

(2)  v2hat  =  0 

F(2 , 67)  =  0.46 

Prob  >  F  =  0.6361 

This  is  the  same  statistic  obtained  above  with  estat  endogenous. 

Tables  11.6  and  11.7  give  the  first-stage  regressions  for  the  probability  of  arrest  and  police 
per  capita.  The  R2  of  these  regressions  are  0.47  and  0.56,  respectively.  The  F-statistics  for  the 
significance  of  all  slope  coefficients  are  3.11  and  4.42,  respectively.  The  additional  instruments 
(offense  mix  and  per  capita  tax  revenue)  are  jointly  significant  in  both  regressions  yielding  F- 
statistics  of  5.78  and  10.56  with  p-values  of  0.0048  and  0.0001,  respectively.  Although  there  are 
two  right  hand  side  endogenous  regressors  in  the  crime  equation  rather  than  one,  the  Stock  and 
Watson  ‘rule  of  thumb’  suggest  that  these  instruments  may  be  weak. 
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Table  11.5  Robust  Variance  Covariance  (2SLS)  Regression:  Crime  in  North  Carolina 


Instrumental  variables  (2SLS)  regression 

Number  of  obs  = 

90 

Wald  chi2(20)  = 

1094.07 

Prob  >  chi2  = 

0.0000 

R-squared  = 

0.8446 

Root  MSE 

.21511 

lcrmrte 

Coef. 

Robust 

Std.  Err. 

z 

P>\z\ 

[95%  Conf.  Interval] 

lprbarr 

-.4393081 

.311466 

-1.41 

0.158 

-1.04977 

.1711541 

lpolpc 

.5136133 

.2483426 

-2.07 

0.039 

.0268707 

1.000356 

lprbconv 

-.2713278 

.1138502 

-2.38 

0.017 

-.4944701 

-.0481855 

lprbpris 

-.0278416 

.1339361 

-0.21 

0.835 

-.2903516 

.2346685 

lavgsen 

-.280122 

.1204801 

-2.33 

0.020 

-.5162587 

-.0439852 

ldensity 

.3273521 

.0983388 

3.33 

0.001 

.1346116 

.5200926 

lwcon 

.3456183 

.1961291 

1.76 

0.078 

-.0387877 

.7300243 

lwtuc 

.1773533 

.1942597 

0.91 

0.361 

-.2033887 

.5580952 

lwtrd 

.212578 

.2297782 

0.93 

0.355 

-.2377789 

.6629349 

lwfir 

-.3540903 

.2299624 

-1.54 

0.124 

-.8048082 

.0966276 

lwser 

-.2911556 

.0865243 

-3.37 

0.001 

-.4607401 

-.121571 

lwmfg 

.0642196 

.1459929 

0.44 

0.660 

-.2219213 

.3503605 

lwfed 

.2974661 

.3089013 

0.96 

0.336 

-.3079692 

.9029015 

lwsta 

.0037846 

.2861629 

0.01 

0.989 

-.5570843 

.5646535 

lwloc 

-.4336541 

.4840087 

-0.90 

0.370 

-1.382294 

.5149856 

lpctymle 

.0095115 

.2232672 

0.04 

0.966 

-.4280842 

.4471073 

lpctmin 

.2285766 

.0531983 

4.30 

0.000 

.1243099 

.3328434 

west 

-.0952899 

.1293715 

-0.74 

0.461 

-.3488534 

.1582736 

central 

-.1792662 

.0651109 

-2.75 

0.006 

-.3068813 

-.0516512 

urban 

-.1139416 

.1065919 

-1.07 

0.285 

-.3228579 

.0949747 

_cons 

-1.159015 

3.791608 

-0.31 

0.760 

-8.59043 

6.2724 

Instrumented:  lprbarr  lpolpc 

Instruments:  lprbconv  lprbpris  lavgsen  ldensity  lwcon  lwtuc  lwtrd  lwfir  lwser  lwmfg  lwfed  lwsta 

lwloc  lpctymle  lpctmin  west  central  urban  ltaxpc  lmix 


One  could  obtain  a  lot  of  diagnostics  on  weak  instruments  after  ( ivregress  2sls )  in  Stata  by 
issuing  the  command  ( estat  firststage ).  This  is  done  in  Table  11.8.  The  option  forcenonrobust 
is  forcing  these  diagnostics  to  be  done  for  a  robust  regression  where  the  econometric  theory 
behind  their  derivation  need  not  apply. 

This  is  a  just-identified  equation,  so  we  cannot  test  over-identification.  We  have  already  seen 
the  R-squared  of  the  first  stage  regressions  (0.47  and  0.56).  These  are  not  low  enough  to  flag 
possible  weak  instruments.  But  these  R-squared  measures  may  be  due  mostly  to  the  inclusion 
of  the  right  hand  side  exogenous  variables  X\ .  We  want  to  know  the  additional  contribution 
of  the  instruments  X2  =  (ltaxpc  lmix)  over  and  above  X\.  The  partial  R-squared  provide  such 
measures  and  yield  lower  numbers.  For  lprbarr,  this  is  0.32.  This  is  the  correlation  between 
lprbarr  and  the  instruments  X2  =  (ltaxpc  lmix)  after  including  the  right  hand  side  exogenous 
variables  X\ .  The  F(2,69)  statistic  tests  the  joint  significance  of  the  excluded  instruments  X2 
in  the  first  stage  regressions.  These  F-statistics  of  6.58  and  6.68  are  certainly  less  than  10. 
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Table  11.6  First  Stage  Regression:  Probability  of  Arrest 


Source 

ss 

df 

MS 

Number  of  obs 

=  90 

F(20,69) 

3.11 

Model 

6.84874028 

20 

0.342437014 

Prob  >  F 

=  0.0002 

Residual 

7.59345096 

69 

0.110050014 

R-squared 

=  0.4742 

Adj  R-squared 

=  0.3218 

Total 

14.4421912 

89 

0.162271812 

Root  MSE 

=  0.33174 

lprbarr 

Coef. 

Std.  Err. 

t 

P>\t\ 

[95%  Conf.  Interval] 

lprbconv 

-.1946392 

.0877581 

-2.22 

0.030 

-.3697119 

-.0195665 

lprbpris 

-.0240173 

.1732583 

-0.14 

0.890 

-.3696581 

.3216236 

lavgsen 

.1565061 

.1527134 

1.02 

0.309 

-.1481488 

.4611611 

ldensity 

-.2211654 

.0941026 

-2.35 

0.022 

-.408895 

-.0334357 

lwcon 

-.2024569 

.3020226 

-0.67 

0.505 

-.8049755 

.4000616 

lwtuc 

-.0461931 

.230479 

-0.20 

0.842 

-.5059861 

.4135999 

lwtrd 

.0494793 

.4105612 

0.12 

0.904 

-.769568 

.8685266 

lwfir 

.050559 

.3507405 

0.14 

0.886 

-.6491492 

.7502671 

lwser 

.0551851 

.1500094 

0.37 

0.714 

-.2440754 

.3544456 

lwmfg 

.0550689 

.2138375 

0.26 

0.798 

-.3715252 

.481663 

lwfed 

.2622408 

.4454479 

0.59 

0.558 

-.6264035 

1.150885 

lwsta 

-.4843599 

.3749414 

-1.29 

0.201 

-1.232347 

.2636277 

lwloc 

.7739819 

.5511607 

1.40 

0.165 

-.3255536 

1.873517 

lpctymle 

-.3373594 

.2203286 

-1.53 

-0.130 

.776903 

.1021842 

lpctmin 

-.0096724 

.0729716 

-0.13 

-0.895 

.1552467 

.1359019 

west 

.0701236 

.1756211 

0.40 

0.691 

-.280231 

.4204782 

central 

.0112086 

.1034557 

0.11 

0.914 

-.1951798 

.217597 

urban 

-.0150372 

.2026425 

-0.07 

0.941 

-.4192979 

.3892234 

ltaxpc 

-.1938134 

.1755345 

-1.10 

0.273 

-.5439952 

.1563684 

lmix 

.2682143 

.0864373 

3.10 

0.003 

.0957766 

.4406519 

_cons 

-4.319234 

3.797113 

-1.14 

0.259 

-11.89427 

3.255799 

This  is  the  rule  of  thumb  suggested  by  Stock  and  Watson  for  the  case  of  one  right  hand  side 
endogenous  variable.  Note  that  we  computed  these  statistics  above  but  without  the  robust 
variance-covariance  matrix  option.  Shea’s  partial  R-squared  of  0.135  for  lprbarr  is  the  R-squared 
from  running  a  regression  of  residuals  on  residuals.  The  first  residuals  come  from  regressing 
lprbarr  on  the  right  hand  side  included  exogenous  variables  X\.  The  second  set  of  residuals  come 
from  regressing  the  right  hand  side  included  exogenous  variables  X\  on  the  set  of  instruments 
X2  =  (ltaxpc  lrnix). 

Because  we  have  more  than  one  right  hand  side  endogenous  variable,  Stock  and  Yogo  (2005) 
suggest  using  the  minimum  eigenvalue  of  a  matrix  analog  of  the  F-statistic  originally  proposed 
by  Cragg  and  Donald  (1993)  to  test  for  identification.  A  low  minimum  eigenvalue  statistic 
indicate  weak  instruments.  If  there  is  only  one  right  hand  side  endogenous  variable,  this  reverts 
back  to  the  F-statistic  which  is  compared  to  10  by  the  ad  hoc  rule  of  Stock  and  Watson.  The 
critical  values  for  this  minimum  eigenvalue  statistic  are  dependent  on  how  much  relative  bias 
we  are  willing  to  tolerate  relative  to  OLS  in  case  of  weak  instruments.  This  is  available  only 
when  the  degree  of  over-identification  is  two  or  more.  This  is  why  Stata  does  not  report  it  in 
this  just-identified  example.  However,  Stata  does  report  a  second  test  which  applies  even  for  the 
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Table  11.7  First  Stage  Regression:  Police  per  Capita 


Source 

ss 

df 

MS 

Number  of  obs 

=  90 

F(20,69) 

4.42 

Model 

6.99830344 

20 

0.349915172 

Prob  >  F 

=  0.0000 

Residual 

5.46683312 

69 

0.079229465 

R-squared 

=  0.5614 

Aclj  R-squared 

=  0.4343 

Total 

12.4651366 

89 

0.140057714 

Root  MSE 

=  0.28148 

lpolpc 

Coef. 

Std.  Err. 

t 

P>\t\ 

[95%  Conf.  Interval] 

lprbconv 

.0037744 

.0744622 

0.05 

0.960 

-.1447736 

.1523223 

lprbpris 

-.0487064 

.1470085 

-0.33 

0.741 

-.3419802 

.2445675 

lavgsen 

.3958972 

.1295763 

3.06 

0.003 

.1373996 

.6543948 

ldensity 

.0201292 

.0798454 

0.25 

0.802 

-.1391581 

.1794165 

lwcon 

-.5368469 

.2562641 

-2.09 

0.040 

-1.04808 

-.025614 

lwtuc 

-.0216638 

.1955598 

-0.11 

0.912 

-.411795 

.3684674 

lwtrd 

-.4207274 

.3483584 

-1.21 

0.231 

-1.115683 

.2742286 

lwfir 

.0001257 

.2976009 

0.00 

1.000 

-.5935718 

.5938232 

lwser 

.0973089 

.1272819 

0.76 

0.447 

-.1566116 

.3512293 

lwmfg 

.1710295 

.1814396 

0.94 

0.349 

-.1909327 

.5329916 

lwfed 

.8555422 

.3779595 

2.26 

0.027 

.1015338 

1.609551 

lwsta 

-.1118764 

.3181352 

-0.35 

0.726 

-.7465387 

.5227859 

lwloc 

1.375102 

.4676561 

2.94 

0.004 

.4421535 

2.30805 

lpctymle 

.4186939 

.1869473 

2.24 

0.028 

.0457442 

.7916436 

lpctmin 

-.0517966 

.0619159 

-0.84 

0.406 

-.1753154 

.0717222 

west 

.1458865 

.1490133 

0.98 

0.331 

-.151387 

.4431599 

central 

.0477227 

.0877814 

0.54 

0.588 

-.1273964 

.2228419 

urban 

-.1192027 

.1719407 

-0.69 

0.490 

-.4622151 

.2238097 

ltaxpc 

.5601989 

.1489398 

3.76 

0.000 

.2630721 

.8573258 

lmix 

.2177256 

.0733414 

2.97 

0.004 

.0714135 

.3640378 

_cons 

-16.33148 

3.221824 

-5.07 

0.000 

-22.75884 

-9.904113 

just-identified  case.  This  is  based  on  size  distortions  of  the  Wald  test  for  the  joint  significance  of 
the  right  hand  side  endogenous  variables  Y\  at  the  5%  level.  The  observed  minimum  eigenvalue 
statistic  of  5.31  is  between  the  critical  values  of  7.03  and  4.58  and  indicates  a  2SLS  relative  bias 
of  more  than  10%  and  less  than  15%  when  it  should  be  5%. 

Example  2:  Growth  and  Inequality  Reconsidered.  Lundberg  and  Squire  (2003)  estimate  a  two 
equation  model  of  growth  and  inequality  using  3SLS,  see  section  10.5  for  the  SUR  specification 
where  all  the  explanatory  variables  were  assumed  to  be  exogenous.  The  first  equation  relates 
Growth  (dly)  to  education  (adult  years  schooling:  yrt),  the  share  of  government  consumption  in 
GDP  (gov),  M2/GDP  (m2y),  Inflation  (inf),  Sachs- Warner  measure  of  openness  (swo),  changes 
in  the  terms  of  trade  (dtot),  initial  income  (f_pcy),  dummy  for  1980s  (d80)  and  dummy  for 
1990s  (d90).  The  second  equation  relates  the  Gini  coefficient  (gih)  to  education,  M2/GDP, 
civil  liberties  index  (civ),  mean  land  Gini  (mlg),  mean  land  Gini  interacted  with  a  dummy  for 
developing  countries  (mlgldc).  The  data  contains  119  observations  for  38  countries  over  the 
period  1965-1990,  and  can  be  obtained  from 

http:/ /www. res.org.uk/economic/datasets/datasetlist. asp. 
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Table  11.8  Weak  IV  Diagnostics:  The  Crime  Example 


.  estat  firststage,  forcenonrobust  all 
First-stage  regression  summary  statistics 


Adjusted 

Partial 

Robust 

Variable 

R-sq. 

R-sq. 

R-sq. 

E(2,69) 

Prob  >  F 

lprbarr 

0.4742 

0.3218 

0.1435 

6.57801 

0.0024 

lpolpc 

0.5614 

0.4343 

0.2344 

6.68168 

0.0022 

Shea’s  partial  R-squared 

Shea’s 

Shea’s 

Variable 

Partial  R-sq. 

Aclj.  Partial  R-sq. 

lprbarr 

0.1352 

-0.0996 

lpolpc 

0.2208 

0.0093 

Minimum  eigenvalue  statistic  =  5.31166 

Critical  Values 

#  of  endogenous  regressors: 

2 

Ho:  Instruments  are  weak 

#  of  excluded  instruments: 

2 

5% 

10% 

20% 

30% 

2SLS  relative  bias 

(not  available) 

10% 

15% 

20% 

25% 

2SLS  Size  of  nominal  5%  Wald  test 

7.03 

4.58 

3.95 

3.63 

LIML  Size  of  nominal  5%  Wald  test 

7.03 

4.58 

3.95 

3.63 

Education,  government,  M2/GDP,  inflation,  Sachs- Warner  measure  of  openness,  civil  liberties 
index,  mean  land  Gini,  mean  land  Gini  interacted  with  a  dummy  for  developing  countries (ldc) 
are  assumed  to  be  endogenous.  Instruments  include  initial  values  of  all  variables  (except  land 
Gini  and  income),  population,  urban  share,  life  expectancy,  fertility,  initial  female  literacy  and 
democracy,  arable  area,  dummies  for  oil  and  non-oil  commodity  exporters,  and  legal  origin. 
Table  11.9  reports  the  3SLS  estimates  of  these  two  equations  using  the  reg3  command  in  Stata. 
The  results  replicate  those  reported  in  Table  1  of  Lundberg  and  Squire  (2003,  p.  334).  Allowing 
for  endogeneity,  these  results  still  show  that  openness  enhances  growth  and  education  reduces 
inequality. 


Notes 

1.  A  heteroskedasticity-robust  statistic  is  recommended  especially  if  y2  has  discrete  characteristics. 

2.  Why  10?  See  the  proof  in  Appendix  10.4  of  Stock  and  Watson  (2003). 

3.  This  test  is  also  known  as  the  Durbin- Wu-Hausman  test,  following  the  work  of  Durbin  (1954),  Wu 
(1973)  and  Hausman  (1978). 
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Table  11.9  Growth  and  Inequality:  3SLS  Estimates 


reg3  (Growth:  dly  =  yrt  gov  m2y  inf  swo  dtot  f_pcy  d80  d90)  (Inequality:  gih  =  yrt  m2y  civ  mlg 
mlgldc),  exog(commod  f_civ  f_clem  f_dtot  f_flit  Lgov  f_inf  f_m2y  f_swo  f_yrt  pop  urb  lex  Ifr  marea  oil 
legor  _fr  legor_ge  legor_mx  legor_sc  legor_uk)  endog(yrt  gov  m2y  inf  swo  civ  mlg  mlgldc) 


Three-stage  least-squares  regression 


Equation 

Obs 

Parms 

RMSE 

“R-sq” 

chi2 

P 

Growth 

119 

9 

2.34138 

0.3905 

65.55 

0.0000 

Inequality 

119 

5 

7.032975 

0.4368 

94.28 

0.0000 

Coef. 

Std.  Err. 

z 

P>\z\ 

[95%  Conf.  Interval] 

Growth 

yrt 

-.0280625 

.1827206 

-0.15 

0.878 

-.3861882 

.3300632 

gov 

-.0533221 

.0447711 

-1.19 

0.234 

-.1410718 

.0344276 

m2y 

.0085368 

.0199759 

0.43 

0.669 

-.0306152 

.0476889 

inf 

-.0008174 

.0025729 

-0.32 

0.751 

-.0058602 

.0042254 

SWO 

4.162776 

.9499015 

4.38 

0.000 

2.301003 

6.024548 

dtot 

26.03736 

23.05123 

1.13 

0.259 

-19.14221 

71.21694 

Lpcy 

-1.38017 

.5488437 

-2.51 

0.012 

-2.455884 

-.3044564 

d80 

-1.560392 

.545112 

-2.86 

0.004 

-2.628792 

-.4919922 

d90 

-3.413661 

.6539689 

-5.22 

0.000 

-4.695417 

-2.131906 

_cons 

13.00837 

3.968276 

3.28 

0.001 

5.230693 

20.78605 

Inequality 

yrt 

-1.244464 

.4153602 

-3.00 

0.003 

-2.058555 

-.4303731 

m2y 

-.120124 

.0581515 

-2.07 

0.039 

-.2340989 

-.0061492 

civ 

.2531189 

.7277433 

0.35 

0.728 

-1.173232 

1.67947 

mlg 

.292672 

.0873336 

3.35 

0.001 

.1215012 

.4638428 

mlgldc 

-.0547843 

.0576727 

-0.95 

0.342 

-.1678207 

.0582522 

_cons 

33.13231 

5.517136 

6.01 

0.000 

22.31893 

43.9457 

Endogenous  variables:  dly  gih  yrt  gov  m2y  inf  swo  civ  mlg  mlgldc 

Exogenous  variables:  dtot  f_pcy  d80  d90  commod  Lciv  f_dem  f_dtot  f_flit  Lgov  f_inf  f_m2y  f_swo 

Lyrt  pop  urb  lex  lfr  marea  oil  legordf  legor_ge  legor_mx  legor_sc  legor_uk 


Problems 

1.  The  Inconsistency  of  OLS.  Show  that  the  OLS  estimator  of  6  in  (11.14),  which  can  be  written  as 

Sols  =  ELiPWELiPt 

is  not  consistent  for  6.  Hint:  Write  Sols  =  S  +  E^LiP*(M2t  —  f^VEtli Pt>  and  use  (11-18)  to 
show  that 

plim  60ls  =  S  +  ( a12  -  o 22)(<5  -  /3)/[cth  +  <r22  -  2 cr12]. 

2.  When  Is  the  IV  Estimator  Consistent?  Consider  equation  (11.30)  and  let  X3  and  X4  be  the  only 
two  other  exogenous  variables  in  this  system. 

(a)  Show  that  a  two-stage  estimator  which  regresses  y2  on  X1}  X2  and  X3  to  get  y2  =  y2  +  v2, 
and  y3  on  X3,  X2  and  X4  to  get  y3  =  y3  +  v3,  and  then  regresses  y\  on  y2l  y3  and  X\  and 
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X2  does  not  necessarily  yield  consistent  estimators.  Hint:  Show  that  the  composite  error  is 
ei  =  (tti  +  a  12V2  +  0113V3)  and  Y2t-i  tittiz t  7^  0,  because  Y2t-i  2/2t^3t  Y  0.  The  latter  does  not 
hold  because  YYt-i  ^3t^3t  /  0-  (This  shows  that  if  both  y’s  are  not  regressed  on  the  same 
set  of  X’s,  the  resulting  two  stage  regression  estimates  are  not  consistent). 

(b)  Show  that  the  two-stage  estimator  which  regresses  j/2  and  2/3  on  X2,  X3  and  X4  to  get 
2/2  =  2/2  +  ^2  and  2/3  =  2/3  +  V3  and  then  regresses  2/1  on  y2,  V 3  and  Xi  and  X2  is  not 
necessarily  consistent.  Hint:  Show  that  the  composite  error  term  =  U\  +  a12{i2  +  0132)3 
does  not  satisfy  YYt=i  tit  X  it  =  0,  since  Y2o-x  ^2tX\t  ^  0  and  YYt=i  ^3tXit  Y  0-  (This  shows 
that  if  one  of  the  included  X’s  is  not  included  in  the  first  stage  regression,  then  the  resulting 
two-stage  regression  estimates  are  not  consistent). 

3.  Just- Identification  and  Simple  IV.  If  equation  (11.34)  is  just-identified,  then  X2  is  of  the  same 
dimension  as  Y\ ,  i.e. ,  both  are  T  x  g\.  Hence,  Z\  is  of  the  same  dimension  as  X,  both  of  dimension 
T  x  (gi  +  fci).  Therefore,  X' Z\  is  a  square  nonsingular  matrix  of  dimension  (<?  1  +  k\).  Hence, 
(®(X)_1  exists.  Using  this  fact,  show  that  8iy2SLS  given  by  (11.36)  reduces  to  (X' Zi)~lX'yi. 
This  is  exactly  the  IV  estimator  with  W  =  X ,  given  in  (11.41).  Note  that  this  is  only  feasible  if 
X'iTi  is  square  and  nonsingular. 

4.  2SLS  Can  Be  Obtained  as  GLS.  Premultiply  equation  (11.34)  by  X'  and  show  that  the  transformed 

disturbances  X'lti  ~  (0,  o\\ 

(X'X)).  Perform  GLS  on  the  resulting  transformed  equation  and  show  that  6i,gls  is  <*>1,2 sls, 
given  by  (11.36). 

5.  The  Equivalence  of  SSLS  and  2SLS. 

(a)  Show  that  83SLS  given  in  (11.46)  reduces  to  82SLS ,  when  (i)  S  is  diagonal,  or  (ii)  every  equa¬ 
tion  in  the  system  is  just-identified.  Hint:  For  (i);  show  that  E ~1(g)Px  is  block-diagonal  with 
the  i-th  block  consisting  of  Px/^u-  Also,  Z  is  block-diagonal,  therefore,  {Z'{ E-1  iS)Px\Z}~1 
is  block-diagonal  with  the  *-th  block  consisting  of  ^^(ZiPxZiY1.  Similarly,  computing 
Z'{ E-1  ®  Px\y ,  one  can  show  that  the  'i-th  element  of  83SLS  is  {Z'iPxZi)~1  Z[Pxyi  = 
8i,2SLS-  For  (ii);  show  that  Z'X  is  square  and  nonsingular  under  just-identification.  There¬ 
fore,  8i,2SLS  =  (X' Zi)~1X’yi  from  problem  3.  Also,  from  (11.44),  we  get 

63  sls  =  {diag[X'X](E-1®(X,X)-1)diag[X,Zi]}-1 

{diag[®'X](E-1  ®  (X'X)"1)(/G  ®  X')y}. 

Using  the  fact  that  Z[X  is  square,  one  can  show  that  8i,3SLS  =  (X' Zi)~1X'yi. 

(b)  Premultiply  the  system  of  equations  in  (11.43)  by  (IG  ®  Px)  and  let  y*  =  (Iq  ®  Px)y , 
Z*  =  (Iq  ®  Px)Z  and  u*  =  (I a  ®  Px)u:  then  y*  =  Z*b  +  it*. Show  that  OLS  on  this 
transformed  model  yields  2SLS  on  each  equation  in  (11.43).  Show  that  GLS  on  this  model 
yields  3SLS  (knowing  the  true  E)  given  in  (11.45).  Note  that  var(u*)  =  E  ®  Px  and  its 
generalized  inverse  is  E-1  ®  P\-  Use  the  Milliken  and  Albohali  condition  for  the  equivalence 
of  OLS  and  GLS  given  in  equation  (9.7)  of  Chapter  9  to  deduce  that  3SLS  is  equivalent 
to  2SLS  if  Z*'( E-1  ®  Px)Pz*  =  0.  Show  that  this  reduces  to  the  following  necessary  and 
sufficient  condition  <jd Z[P^  =  0  for  i  j,  see  Baltagi  (1989).  Hint:  Use  the  fact  that 

=  diag  [PxZi]  =  diag  [Z^  and  Pz .  =  diag[PgJ. 

Verify  that  the  two  sufficient  conditions  given  in  part  (a)  satisfy  this  necessary  and  sufficient 
condition. 
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6.  Consider  the  following  demand  and  supply  equations: 

Q  =  a  —  bP  +  u\ 

Q  =  c  +  dP  +  eW  +  fL  +  U2 

where  W  denotes  weather  conditions  affecting  supply,  and  L  denotes  the  supply  of  immigrant 
workers  available  at  harvest  time. 

(a)  Write  this  system  in  the  matrix  form  given  by  equation  (A.l)  in  the  Appendix. 

(b)  What  does  the  order- condition  for  identification  say  about  these  two  equations? 

(c)  Premultiply  this  system  by  a  nonsingular  matrix  F  =  [fij],  for  i.  j  =  1,2.  What  restrictions 
must  the  matrix  F  satisfy  if  the  transformed  model  is  to  satisfy  the  same  restrictions  of  the 
original  model?  Show  that  the  first  row  of  F  is  in  fact  the  first  row  of  an  identity  matrix,  but 
the  second  row  of  F  is  not  the  second  row  of  an  identity  matrix.  What  do  you  conclude? 

7.  Answer  the  same  questions  in  problem  6  for  the  following  model: 

Q  =  a  —  bP  +  cY  +  dA  +  u± 

Q  =  e+  fP  +  gW  +  hL  +  u2 

where  Y  is  real  income  and  A  is  real  assets. 

8.  Consider  example  (A.l)  in  the  Appendix.  Recall,  that  system  of  equations  (A. 3)  and  (A. 4)  are 
just-identified. 

(a)  Construct  <f  for  the  demand  equation  (A. 3)  and  show  that  A<j>  =  (0,  —  /)'  which  is  of  rank 
1  as  long  as  /  fy  0.  Similarly,  construct  <j>  for  the  supply  equation  (A. 4)  and  show  that 
Acj)  =  (— c,  0)'  which  is  of  rank  1  as  long  as  c  fy  0. 

(b)  Using  equation  (A. 17),  show  how  the  structural  parameters  can  be  retrieved  from  the  reduced 
form  parameters.  Derive  the  reduced  form  equations  for  this  system  and  verify  the  above 
relationships  relating  the  reduced  form  and  structural  form  parameters. 

9.  Derive  the  reduced  form  equations  for  the  model  given  in  problem  6,  and  show  that  the  structural 
parameters  of  the  second  equation  cannot  be  derived  from  the  reduced  form  parameters.  Also,  show 
that  there  are  more  than  one  way  of  expressing  the  structural  parameters  of  the  first  equation  in 
terms  of  the  reduced  form  parameters. 

10.  Just- Identified  Model.  Consider  the  just-identified  equation 

yi  =  Zl6  r  +  Mi 

with  W,  the  matrix  of  instruments  for  this  equation  of  dimension  T  x  l  where  I  =  g\  +  ki  the 
dimension  of  Z\.  In  this  case,  W' Z\  is  square  and  nonsingular. 

(a)  Show  that  the  generalized  instrumental  variable  estimator  given  below  (11.41)  reduces  to  the 
simple  instrumental  variable  estimator  given  in  (11.38). 

(b)  Show  that  the  minimized  value  of  the  criterion  function  for  this  just-identified  model  is  zero, 
i.e.,  show  that  (2/1  -  Z^ijv)' Pw(yi  —  ^i<5i,/r)  =  0. 

(c)  Conclude  that  the  residual  sum  of  squares  of  the  second  stage  regression  of  this  just-identified 
model  is  the  same  as  that  obtained  by  regressing  1/1  on  the  matrix  of  instruments  W,  i.e., 
show  that  (2/1  —  Zi6ijv)'(yi  ~  ■Z’l^i./y)  =  2/1  PwUi  where  Z\  =  PwZ\.  Hint:  Show  that 
Pg  =  PpwZ\  =  Pw,  under  just-identification. 
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11.  The  More  Valid  Instruments  the  Better.  Let  W±  and  Wi  be  two  sets  of  instrumental  variables  for 
the  first  structural  equation  given  in  (11.34).  Suppose  that  W\  is  spanned  by  the  space  of  W2. 
Verify  that  the  resulting  IV  estimator  of  61  based  on  W2  is  at  least  as  efficient  as  that  based 
on  W\.  Hint:  Show  that  P\v2 W\  =  Wi  and  that  PW2  —  PWl  is  idempotent.  Conclude  that  the 
difference  in  the  corresponding  asymptotic  covariances  of  these  IV  estimators  is  positive  semi- 
definite.  (This  shows  that  increasing  the  number  of  legitimate  instruments  should  improve  the 
asymptotic  efficiency  of  an  IV  estimator). 

12.  Testing  for  Over-Identification.  In  testing  Ha]  7  =  0  versus  Hi]  7  0  in  section  11.4,  equation 

(11.47): 

(a)  Show  that  the  second  stage  regression  of  2SLS  on  the  unrestricted  model  y\  =  Zi8i+W*')+Ui 
with  the  matrix  of  instruments  W  yields  the  following  residual  sum  of  squares: 

URSS*  =  y'xPwm  =  y'iyi  -  y'iPwyi 

Hint:  Use  the  results  of  problem  10  for  the  just-identified  case. 

(b)  Show  that  the  second  stage  regression  of  2SLS  on  the  restricted  model  y\  =  Z161  +  U\  with 
the  matrix  of  instruments  W  yields  the  following  residual  sum  of  squares: 

rrss*  =  y'iPwyi  =  y'iyi  -  y'ipzjn 

where  Zx  =  PWZX  and  P2l  =  PwZ^PwZf)-1  Z[PW.  Conclude  that  RRSS*-  URSS* 
yields  (11.49). 

(c)  Consider  the  test  statistic  (RRSS*—  URSS*)  fan  where  an  is  given  by  (11.50)  as  the  usual 
2SLS  residual  sum  of  squares  under  Ha  divided  by  T.  Show  that  it  can  be  written  as  Haus- 
man’s  (1983)  test  statistic,  i.e.,  TR \  where  Rf  is  the  uncentered  R 2  of  the  regression  of  2SLS 
residuals  (y\  —  ZRi^sls)  on  the  matrix  of  all  pre-determined  variables  W.  Hint:  Show  that 
the  regression  sum  of  squares  ( yi  —  ZRi^sls)' Pw(yi  —  ZRi,2Sls)  =  (RRSS*—  URSS*) 
given  in  (11.49). 

(d)  Verify  that  the  test  for  Ha  based  on  the  GNR  for  the  model  given  in  part  (a)  yields  the  same 
Tiff  test  statistic  described  in  part  (c). 

13.  Hausman’s  Specification  Test:  OLS  versus  2SLS.  This  is  based  on  Maddala  (1992,  page  511).  For 
the  simple  regression 


yt  =  (3xt  +  ut  t  =  1,2  ...  ,T 

where  (3  is  scalar  and  ut  ~  IIN(0,  rr2).  Let  wt  be  an  instrumental  variable  for  Xt-  Run  a :t  on  wt 
and  get  xt  =  nwt  +  vt  or  xt  =  xt  +  Vt  where  x t  =  7 rwt  ■ 

(a)  Show  that  in  the  augmented  regression  yt  =  j3xt  +  "fXt  +  e*  a  test  for  7  =  0  based  on  OLS 
from  this  regression  yields  Hausman’s  test-statistic.  Hint:  Show  that  7 OLs  =  5/(1  —  rfw) 
where 


rxw  =  ^7=1  XtWt )  /  Lt=l  Wt  Lt=  1  Xt  ■ 

Next,  show  that  var(7OLS)  =  yex(f3OLS) /r^.w(l  —  r^.w).  Conclude  that 

10Ls/™(lOLs)  =  ?2rlw /[var(3ois)(1  -  rxw)} 

is  the  Hausman  (1978)  test  statistic  m  given  in  section  11.5. 
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(b)  Show  that  the  same  result  in  part  (a)  could  have  been  obtained  from  the  augmented  regression 
yt  =  fat  +  1%  +  rjt 

where  vt  is  the  residual  from  the  regression  of  xt  on  u;t. 

14.  Consider  the  following  structural  equation:  2/1  =  0122/2  +  0732/3  +  P14X1  +  /312X 2  +  u\  where  2/2 
and  2/3  are  endogenous  and  Xi  and  X2  are  exogenous.  Also,  suppose  that  the  excluded  exogenous 
variables  include  X3  and  X4. 

(a)  Show  that  Hausman’s  test  statistic  can  be  obtained  from  the  augmented  regression: 

2/1  =  OT22/2  +  0732/3  +  Pl\Xi  +  (312X2  +  722/2  +  73^3  +  £i 

where  y2  and  1/3  are  predicted  values  from  regressing  y2  and  2/3  on  X  =  [Xi ,  X2 ,  X3 ,  X4] . 
Hausman’s  test  is  equivalent  to  testing  Ha;  72  =  73  =  0.  See  equation  (11.54). 

(b)  Show  that  the  same  results  in  part  (a)  hold  if  we  had  used  the  following  augmented  regression: 

2/1  =  0122/2  +  0132/3  +  P11X1  +  P12X2  +  72P2  +  73^3  +  Vi 

where  v2  and  V3  are  the  residuals  from  running  y2  and  2/3  on  X  =  [Xi,  X2l  X3,  X4].  See 
equation  (11.55).  Hint:  Show  that  the  regressions  in  (a)  and  (b)  have  the  same  residual  sum 
of  squares. 

15.  For  the  artificial  regression  given  in  (11.55): 

(a)  Show  that  OLS  on  this  model  yields  Si^ols  =  <$i jv  =  (Z[PW Zi)^1  Z[Pwyi.  Hint:  V)  —  Y)  = 
P\yYi-  Use  the  FWL  Theorem  to  residual  out  these  variables  in  (11.55)  and  use  the  fact  that 
Pw% 1  =  [PwY- 1,0]. 

(b)  Show  that  the  var(^iiois)  =  'sn(Zl1PwZi)~1  where  S11  is  the  mean  squared  error  of  the 
OLS  regression  in  (11.55).  Note  that  when  y  ^  0  in  (11.55),  IV  estimation  is  necessary  and 
Sn  underestimates  <7n  and  will  have  to  be  replaced  by  (2/1  -  ZxtjijvYi.yi  ~  Zi6ijV)/T. 

16.  Recursive  Systems.  A  recursive  system  has  two  crucial  features:  B  is  a  triangular  matrix  and 
E  is  a  diagonal  matrix.  For  this  special  case  of  the  simultaneous  equations  model,  OLS  is  still 
consistent,  and  under  normality  of  the  disturbances  still  maximum  likelihood.  Let  us  consider  a 
specific  example: 


2/it  +  71127*  +  h2x2t  =  wit 
/32l2/l  t  +  Vlt  +  723  ^3 1  =  u2t 


In  this  case,  B  = 


1  0 
P21  1 


is  triangular  and 


on  0 
0  <r22 


is  assumed  diagonal. 


(a)  Check  the  identifiability  conditions  of  this  recursive  system. 

(b)  Solve  for  the  reduced  form  and  show  that  2/it  is  only  a  function  of  the  Xt  s  and  uu,  while  y2t 
is  a  function  of  the  27 ’s  and  a  linear  combination  of  u±t  and  u2t- 

(c)  Show  that  OLS  on  the  first  structural  equation  yields  consistent  estimates.  Hint:  There  are 
no  right  hand  side  2/’s  for  the  first  equation.  Show  that  despite  the  presence  of  2/1  in  the  second 
equation,  OLS  of  y2  on  j/i  and  X3  yields  consistent  estimates.  Note  that  2/1  is  a  function  of 
iti  only  and  Mi  and  m2  are  not  correlated. 
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(d)  Under  the  normality  assumption  on  the  disturbances,  the  likelihood  function  conditional  on 
the  x’s  is  given  by 

L(B,T,E)  =  {2n)-T/2\B\T\Z\-T/2  exp^EL^-1^) 

where  in  this  two  equation  case  u't  =  (uit,u2t).  Since  B  is  triangular,  \B\  =  1.  Show  that 
maximizing  L  with  respect  to  B  and  T  is  equivalent  to  minimizing  Q  =  Et=i  u'tYi~lut. 
Conclude  that  when  E  is  diagonal,  E_1  is  diagonal  and  Q  =  Y^t= 1  uit/aii  +  Et=i  u2t/a22- 
Hence,  maximizing  the  likelihood  with  respect  to  B  and  T  is  equivalent  to  running  OLS  on 
each  equation  separately. 

17.  Hausman’s  Specification  Test:  2SLS  Versus  3SLS.  This  is  based  on  Holly  (1988).  Consider  the 
two-equations  model, 

Vi  =  ay2  +  PiX  i  +  02x  2  +  iq 

V2  =  72/i  +  @3x3  +  u2 

where  y-\  and  y2  are  endogenous;  x±,  x2  and  £3  are  exogenous  (the  y' s  and  the  x’s  are  n  x  1 
vectors).  The  standard  assumptions  are  made  on  the  disturbance  vectors  u\  and  u2.  With  the 
usual  notation,  the  model  can  also  be  written  as 

Vi  =  Z181  +  ui 

y2  =  Z2S2  +  u2 

The  following  notation  will  be  used:  S  =  2 SLS,  6  =  3 SLS,  and  the  corresponding  residuals  will 
be  denoted  as  u  and  u,  respectively. 

(a)  Assume  that  07  1.  Show  that  the  3SLS  estimating  equations  reduce  to 

o11X,ui  +  o12X'u2  =  0 
a12Z£Pxll  1  +  a22  Z'2Px7l2  =  0 

where  X  =  (cci,a;2, X3),  E  =  [uij]  is  the  structural  form  covariance  matrix,  and  E-1  =  [crlJ] 
for  i,j  =  1,2. 

(b)  Deduce  that  S2  =  62  and  6±  =  Si  —  (ai2/o22)(Z'1PxZi)~1  Z[PXU2-  This  proves  that  the 
3SLS  estimator  of  the  over-identified  second  equation  is  equal  to  its  2SLS  counterpart.  Also, 
the  3SLS  estimator  of  the  just-identified  first  equation  differs  from  its  2SLS  (or  indirect 
least  squares)  counterpart  by  a  linear  combination  of  the  2SLS  (or  3SLS)  residuals  of  the 
over-identified  equation,  see  Theil  (1971). 

(c)  How  would  you  interpret  a  Hausman-type  test  where  you  compare  <5i  and  61?  Show  that  it 
is  nR2  where  R 2  is  the  i?-squared  of  the  regression  of  u2  on  the  set  of  second  stage  regressors 
of  both  equations  Z\  and  Z2.  Hint:  See  the  solution  by  Baltagi  (1989). 

18.  For  the  two-equation  simultaneous  model 

2/lt  =  0122/2 1  +  711^1 1  +  Bit 

2/2 1  =  021 2/1 1  +  l22X2t  +  723  ^3 1  +  U2t 


With 


'  20  0  O' 

'5  10  ' 

'  O  A 

0  20  0 

X'Y  = 

40  20 

Y'Y  = 

O  4 

A  Q 

0  0  10 

20  30 

4  0 

X'X  = 


292  Chapter  11:  Simultaneous  Equations  Model 

(a)  Determine  the  identifiability  of  each  equation  with  the  aid  of  the  order  and  rank  conditions 
for  identification. 

(b)  Obtain  the  OLS  normal  equations  for  both  equations.  Solve  for  the  OLS  estimates. 

(c)  Obtain  the  2SLS  normal  equations  for  both  equations.  Solve  for  the  2SLS  estimates. 

(d)  Can  you  estimate  these  equations  using  Indirect  Least  Squares?  Explain. 

19.  Laffer  (1970)  considered  the  following  supply  and  demand  equations  for  Traded  Money: 

log (TM/P)  =  a0  +  a\\og(RM / P)  +  a2log  i  +  u\ 

log (TM/P)  =  (3a  +  /J1log(E/P)  +  /32log  i  +  /33log(Sl)  +  /34log(S'2)  +  u2 


where 


TM  =  Nominal  total  trade  money 
RM  =  Nominal  effective  reserve  money 
Y  =  GNP  in  current  dollars 
S 2  =  Degree  of  market  utilization 

i  =  short-term  rate  of  interest 

SI  =  Mean  real  size  of  the  representative  economic  unit  (1939  =  100) 

P  =  GNP  price  deflator  (1958  =  100) 

The  basic  idea  is  that  trade  credit  is  a  line  of  credit  and  the  unused  portion  represents  purchasing 
power  which  can  be  used  as  a  medium  of  exchange  for  goods  and  services.  Hence,  Laffer  (1970) 
suggests  that  trade  credit  should  be  counted  as  part  of  the  money  supply.  Besides  real  income  and 
the  short-term  interest  rate,  the  demand  for  real  traded  money  includes  log(Sl)  and  log(S2).  SI 
is  included  to  capture  economies  of  scale.  As  SI  increases,  holding  everything  else  constant,  the 
presence  of  economies  of  scale  would  mean  that  the  demand  for  traded  money  would  decrease. 
Also,  the  larger  S 2,  the  larger  the  degree  of  market  utilization  and  the  more  money  is  needed  for 
transaction  purposes. 

The  data  are  provided  on  the  Springer  web  site  as  LAFFER. ASC.  This  data  covers  21  annual 
observations  over  the  period  1946-1966.  This  was  obtained  from  Lott  and  Ray  (1992).  Assume 
that  (TM/P)  and  i  are  endogenous  and  the  rest  of  the  variables  in  this  model  are  exogenous. 

(a)  Using  the  order  condition  for  identification,  determine  whether  the  demand  and  supply  equa¬ 
tions  are  identified?  What  happens  if  you  used  the  rank  condition  of  identification? 

(b)  Estimate  this  model  using  OLS. 

(c)  Estimate  this  model  using  2SLS. 

(d)  Estimate  this  model  using  3SLS.  Compare  the  estimates  and  their  standard  errors  for  parts 
(b),  (c)  and  (d). 

(e)  Test  the  over-identification  restriction  of  each  equation. 

(f)  Run  Hausman’s  specification  test  on  each  equation  basing  it  on  OLS  and  2SLS. 

(g)  Run  Hausman’s  specification  test  on  each  equation  basing  it  on  2SLS  and  3SLS. 

20.  The  market  for  a  certain  good  is  expressed  by  the  following  equations: 

Dt  = 

St  = 

Dt  = 


a0  -  aiPt  +  a2Xt  +  uu  (ai,a2>0) 
Po  A  Pi^t  +  U2t  (Pi  >  0) 

St  =  Qt 
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where  Dt  is  the  quantity  demanded,  St  is  the  quantity  supplied,  Xt  is  an  exogenous  demand  shift 
variable.  (uu,u2t)  is  an  IID  random  vector  with  zero  mean  and  covariance  matrix  S  =  [cr^]  for 

i,j  =  1,2. 

(a)  Examine  the  identifiability  of  the  model  under  the  assumptions  given  above  using  the  order 
and  rank  conditions  of  identification. 

(b)  Assuming  the  moment  matrix  of  exogenous  variables  converge  to  a  finite  non-zero  matrix, 
derive  the  simultaneous  equation  bias  in  the  OLS  estimator  of  ft1. 

(c)  If  a  12  =  0  would  you  expect  this  bias  to  be  positive  or  negative?  Explain. 

21.  Consider  the  following  three  equations  simultaneous  model 


yi 

=  ai  +  fl2y2  +  73X1  +  ui 

(1) 

V2 

=  a2  +  j/ 1  +  /33j/3  +  72A2  +  u2 

(2) 

V3 

=  a3  +  73A3  +  74A4  +  75X5  +  u3 

(3) 

where  the  X’s  are  exogenous  and  the  y' s  are  endogenous. 

(a)  Examine  the  identifiability  of  this  system  using  the  order  and  rank  conditions. 

(b)  How  would  you  estimate  equation  (2)  by  2SLS?  Describe  your  procedure  step  by  step. 

(c)  Suppose  that  equation  (1)  was  estimated  by  running  y2  on  a  constant  X2  and  A3  and  the 
resulting  predicted  y2  was  substituted  in  (1),  and  OLS  performed  on  the  resulting  model. 
Would  this  estimating  procedure  yield  consistent  estimates  of  aq,  fi2  and  73?  Explain  your 
answer. 

(d)  How  would  you  test  for  the  over-identification  restrictions  in  equation  (1)? 

22.  Equivariance  of  Instrumental  Variables  Estimators.  This  is  based  on  Sapra  (1997).  For  the  struc¬ 
tural  equation  given  in  (11.34),  let  the  matrix  of  instruments  W  be  of  dimension  T  x  I  where 
I  >  gi  +  k\  as  described  below  (11.41).  Then  the  corresponding  instrumental  variable  estimator 
of  given  below  (11.41)  is  8ltIV{yi)  =  (Z[PW Zi)~x Z[Pwy1. 

(a)  Show  that  this  IV  estimator  is  an  equivariant  estimator  of  <5i,  i.e.,  show  that  for  any  linear 
transformation  y\  =  ay\  +  Zib  where  a  is  a  positive  scalar  and  b  is  an  (£  x  1)  real  vector,  the 
following  relationship  holds: 

8i,iv{yl)  =  aZijviui)  +  b. 

(b)  Show  that  the  variance  estimator 

&2{yi)  =  {yi  ~  z{8ijV(yi))\yi  -  Zi8i,iv(yi))/T 
is  equivariant  for  er2,  i.e.,  show  that  a2(yl)  =  a2a2(yi). 

23.  Identification  and  Estimation  of  a  Simple  Two-Equation  Model.  This  is  based  on  Holly  (1987). 
Consider  the  following  two  equation  model 

yn  =  a  +  (3yt2  +  uti 
yti  =  7  +  Vti  +  ut  2 

where  the  y* s  are  endogenous  variables  and  the  rt’s  are  serially  independent  disturbances  that 
are  identically  distributed  with  zero  means  and  nonsingular  covariance  matrix  E  =  [cr^]  where 
E(utiUtj)  =  (Jij  for  i,j  =  1,2,  and  all  t  =  1,2, ...  ,T.  The  reduced  form  equations  are  given  by 

yti  =  7Th  +  vti  and  yt2  =  7t2i  +  ut2 

with  H  =  [a>ij\  where  E(vtiVtj)  =  uitj  for  i,j  =  1,2  and  all  t  =  1,  2, . . . ,  T. 
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(a)  Examine  the  identification  of  this  system  of  two  equations  when  no  further  information  is 
available. 

(b)  Repeat  part  (a)  when  <ri2  =  0. 

(c)  Assuming  an  =  0,  show  that  Pols>  the  OLS  estimator  of  /?  in  the  first  equation  is  not 
consistent. 

(d)  Assuming  a  12  =  0,  show  that  an  alternative  consistent  estimator  of  P  is  an  IV  estimator 
using  zt  =  [{yt 2  -  £2)  -  (yti  ~  2/1)]  as  an  instrument  for  yt2. 

(e)  Show  that  the  IV  estimator  of  P  obtained  from  part  (d)  is  also  an  indirect  least  squares 
estimator  of  p.  Hint:  See  the  solution  by  Singh  and  Bhat  (1988). 

24.  Errors  in  Measurement  and  the  Wald  (1940)  Estimator.  This  is  based  on  Farebrother  (1985).  Let 
y*  be  permanent  consumption  and  X*  be  permanent  income,  both  are  measured  with  error: 

y*  =  px*  where  Vi  =  y*  +  £i  and  Xi  =  x*  +  Ui  for  i  =  1,  2, . . . ,  n. 

Let  x*,  €i  and  Ui  be  independent  normal  random  variables  with  zero  means  and  variances  <r2,  a 2 
and  a2,  respectively.  Wald  (1940)  suggested  the  following  estimator  of  /3:  Order  the  sample  by 
the  xfs  and  split  the  sample  into  two.  Let  (j/i,  aq)  be  the  sample  mean  of  the  first  half  of  the 
sample  and  (y2,  x2)  be  the  sample  mean  of  the  second  half  of  this  sample.  Wald’s  estimator  of  P  is 
Pw  =  (2/2  —  y\)/(x2  —  xp).  It  is  the  slope  of  the  line  joining  these  two  sample  mean  observations. 

(a)  Show  that  Pw  can  be  interpreted  as  a  simple  IV  estimator  with  instrument 


1  for  Xi  >  median(x) 
1  for  Xi  <  median(x) 


where  median(:r)  is  the  sample  median  of  x±,  x2, . . . ,  xn. 

(b)  Define  uii  =  p2x*  —  r2Ui  where  p2  =  cr\j(a2u  +  a2)  and  r2  =  <72/(<t2  +  a2).  Show  that 
E(xiWi)  =  0  and  that  Wi  ~  V(0,  cr2cr2/((T2  +cr2)). 

(c)  Show  that  x*  =  r2Xi  +  u>i  and  use  it  to  show  that 

E(Pw/x i,  ■  •  • ,  xn)  =  E{POLS/x  1, . .  • ,  xn)  =  Pt2. 

Conclude  that  the  exact  small  sample  bias  of  Pols  an<^  Pw  are  the  same. 

25.  Comparison  of  t-ratios.  This  is  based  on  Holly  (1990).  Consider  the  two  equations  model 

Vi  =  oty2  +  Xp  +  u\  and  y2  =  jyi  +  Xp  +  u2 

where  a  and  7  are  scalars,  yi  and  y2  are  T  x  1  and  X  is  a  T  x  ( K  —  1)  matrix  of  exogenous 
variables.  Assume  that  Ui  ~  IV(0,  <J2It)  for  i  =  1,2.  Show  that  the  t-ratios  for  H“;  a  =  0  and 
7  =  0  using  cxols  an(i  Pols  are  the  same.  Comment  on  this  result.  Hint:  See  the  solution  by 
Farebrother  (1991). 

26.  Degeneration  of  Feasible  GLS  to  2SLS  in  a  Limited  Information  Simultaneous  Equations  Model. 
This  is  based  on  Gao  and  Lahiri  (2000).  Consider  a  simple  limited  information  simultaneous 
equations  model, 


V 1  =  72/2  +  u, 
y2  =  XP+  v, 


(1) 

(2) 


where  y  1,  y2  are  TV  x  1  vectors  of  observations  on  two  endogenous  variables.  V  is  N  x  K  matrix 
of  predetermined  variables  of  the  system,  and  K  >  1  such  that  (1)  is  identified.  Each  row  of 
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(u,  v )  is  assumed  to  be  i.i.d.  (0,  E),  and  E  is  p.d.  In  this  case,  7 2SLS  =  {y'lPxVi)  ly2Px'yi,  where 

Px  =  X(X' X)~1X' .  The  residuals  u  =  yi  —  72slsJ/2  and  v  =  My2 ,  where  M  =  In  —  Px  are  used 
to  generate  a  consistent  estimate  for  E 


E  = 


1 

N 


Show  that  a  feasible  GLS  estimate  of  7  using  E  degenerates  to  72 sls- 

27.  Equality  of  Two  IV  Estimators.  This  is  based  on  Qian  (1999).  Consider  the  following  linear  re¬ 
gression  model: 

yi  =  x'iP  +  £  =  x'li/31+x'2if32  +  ei,  i  =  l,2,...,N,  (1) 


where  the  dimensions  of  x[,  x'u  and  x'2i  are  1  x  AT,  lx K\  and  lx K2,  respectively,  with  K  =  K\+K2. 
Xi  may  be  correlated  with  e,,  but  we  have  instruments  Zi  such  that  E(ei\z'/)  =  0  and  E(e2\z'i)  =  a2. 
Partition  the  instruments  into  two  subsets:  z\  =  (z'vi.  z'2i),  where  the  dimensions  of  2,;,  zu ,  and  z2i 
are  L,Ly  and  L2,  with  L  =  Li  +  L2.  Assume  that  E(zux'i)  has  full  column  rank  (so  Ly  >  K)\  this 
ensures  that  j3  can  be  estimated  consistently  using  the  subset  of  instruments  zu  only  or  using  the 
entire  set  z\  =  (z'xi,z'2j).  We  also  assume  that  (y,;,  x\.  z'f)  is  covariance  stationary. 

Define  Pa  —  A{A' A)~x A'  and  Ma  =  /  —  Pa  for  any  matrix  A  with  full  column  rank.  Let  X  = 
(xi, . . . ,  Xn)',  and  similarly  for  Xi,X2,  y ,  Z-y ,  Z2  and  Z.  Define  X  =  P[Zy]X  and  / 3  =  (X'X)~1X'y, 
so  that  (3  is  the  instrumental  variables  (IV)  estimator  of  (1)  using  Z\  as  instruments.  Similarly, 
define  X  =  PzX  and  f3  =  (X'X)~1X'y,  so  that  /?  is  the  IV  estimator  of  (1)  using  Z  as  instruments. 

Show  that  3i  =  A  if  Z^M^Xy  -  X^X^PyX^1  X^PyXy]  =  0. 

28.  For  the  crime  in  North  Carolina  example  given  in  section  11.6,  replicate  the  results  in  Tables 
11.2-11.6  using  the  data  for  1987.  Do  the  same  using  the  data  for  1981.  Are  there  any  notable 
differences  in  the  results  as  we  compare  1981  to  1987? 


29.  Spatial  Lag  Test  with  Equal  Weights.  This  is  based  on  Baltagi  and  Liu  (2009).  Consider  the  spatial 
lag  dependence  model  described  in  Section  11.2.1,  with  the  N  x  N  spatial  weight  matrix  W  having 
zero  elements  across  the  diagonal  and  equal  elements  1/  (N  —  1)  off  the  diagonal.  The  LM  test  for 
zero  spatial  lag,  i.e.,  Hq  :  p  =  0  versus  H\  :  p  Y  0,  is  given  in  Anselin  (1988).  This  takes  the  form 


LM  = 


_ [ii’Wy/  ( u'u/N )]2 _ 

(WXP)  PxWXj3/v2  +  tr(W2  +  W'W) 


where  Px  =  I  —  X  {X' X)^1  X' ,  f)  is  the  restricted  mle,  which  in  this  case  is  the  least  squares 
estimator  of  (3.  Similarly,  a2  is  the  corresponding  restricted  mle  of  cr2,  which  in  this  case  is  the  least 
squares  residual  sums  of  squares  divided  by  N,  i.e.,  (u'u/N) ,  where  u  denotes  the  least  squares 
residuals.  Show  that  for  the  equal  weight  spatial  matrix,  this  LM  test  statistic  will  always  equal 
N/2  ( N  —  1)  no  matter  what  p  is. 


30.  Growth  and  Inequality  Reconsidered.  For  the  Lundberg  and  Squire  (2003)  Growth  and  Inequality 
example  considered  in  section  11.6. 


(a)  Estimate  these  equations  using  3SLS  and  verify  the  results  reported  in  Table  1  of  Lundberg 
and  Squire  (2003,  p.  334).  These  results  show  that  openness  enhances  growth  and  education 
reduces  inequality,  see  Table  11.9.  How  do  these  3SLS  results  compare  with  2SLS?  Are  the 
over-identification  restrictions  rejected  by  the  data? 
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(b)  Lundberg  and  Squire  (2003)  allow  growth  to  enter  the  inequality  equation,  and  inequality  to 
enter  the  growth  equation.  Estimate  these  respecified  equations  using  3SLS  and  verify  the 
results  reported  in  Table  2  of  Lundberg  and  Squire  (2003,  p.  336).  How  do  these  3SLS  results 
compare  with  2SLS?  Are  the  over-identification  restrictions  rejected  by  the  data? 

31.  Married  Women  Labor  Supply.  Mroz  (1987)  questions  the  exogeneity  assumption  of  the  wife’s 
wage  rate  in  a  simple  specification  of  married  women  labor  supply.  Using  the  PSID  for  1975,  his 
sample  consists  of  753  married  white  women  between  the  ages  of  30  and  60  in  1975,  with  428 
working  at  some  time  during  the  year.  The  wife’s  annual  hours  of  work  (hours)  is  regressed  on  the 
logarithm  of  her  wage  rate  (lwage);  the  nonwife  income  (nwifeinc);  the  wife’s  age  (age),  her  years 
of  schooling  (educ),  the  number  of  children  less  than  six  years  old  in  the  household  (kidslt6),  and 
the  number  of  children  between  the  ages  of  five  and  nineteen  (kidsge6).  The  data  set  was  obtained 
from  the  data  web  site  of  Wooldridge  (2009) . 

(a)  Replicate  Table  III  of  Mroz  (1987,  p.  769)  which  gives  the  descriptive  statistics  of  the  data. 

(b)  Replicate  Table  IV  of  Mroz  (1987,  p.  770)  which  runs  OLS  and  2SLS  using  a  variety  of 
instrumental  variables  for  lwage.  These  instrumental  variables  are  described  in  Table  V  of 
Mroz  (1987,  p.  771). 

(c)  Run  the  over-identification  test  for  each  2sls  regression  in  (b),  as  well  as  the  diagnostics  for 
the  first  stage  regression. 
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Appendix:  The  Identification  Problem  Revisited:  The  Rank 
Condition  of  Identification 

In  section  11.1.2,  we  developed  a  necessary  but  not  sufficient  condition  for  identification.  In 
this  section  we  emphasize  that  model  identification  is  crucial  because  only  then  can  we  get 
meaningful  estimates  of  the  parameters.  For  an  under-identified  model,  different  sets  of  param¬ 
eter  values  agree  well  with  the  statistical  evidence.  As  Bekker  and  Wansbeek  (2001,  p.  144) 
put  it,  preference  for  one  set  of  parameter  values  over  other  ones  becomes  arbitrary.  Therefore, 
“Scientific  conclusions  drawn  on  the  basis  of  such  arbitrariness  are  in  the  best  case  void  and  in 
the  worst  case  dangerous.”  Manski  (1995,  p.  6)  also  warns  that  “negative  identification  findings 
imply  that  statistical  inference  is  fruitless.  It  makes  no  sense  to  try  to  use  a  sample  of  finite  size 
to  infer  something  that  could  not  be  learned  even  if  a  sample  of  infinite  size  were  available.” 

Consider  the  simultaneous  equation  model 


Byt  +  Txt  =  ut 


f  =  1,  2, . . . , T. 


(A.l) 
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which  displays  the  whole  set  of  equations  at  time  t.  B  is  G  x  G,  T  is  G  x  K  and  ut  is  G  x  1.  B  is 
square  and  nonsingular  indicating  that  the  system  is  complete,  i.e. ,  there  are  as  many  equations 
as  there  are  endogenous  variables.  Premultiplying  (A.l)  by  B _1  and  solving  for  yt  in  terms  of 
the  exogenous  variables  and  the  vector  of  errors,  we  get 


yt  =  Uxt  +  vt  t  =  1, 2, . . . ,  T. 


(A.2) 


where  II  =  — H_1r,  is  G  x  K,  and  vt  =  B~lut .  Note  that  if  we  premultiply  the  structural 
model  in  (A.l)  by  an  arbitrary  nonsingular  G  x  G  matrix  F,  then  the  new  structural  model 
has  the  same  reduced  form  given  in  (A.2).  In  this  case,  each  new  structural  equation  is  a 
linear  combination  of  the  original  structural  equations,  but  the  reduced  form  equations  are  the 
same.  One  idea  for  identification,  explored  by  Fisher  (1966),  is  to  note  that  (A.l)  is  completely 
defined  by  B,  T  and  the  probability  density  function  of  the  disturbances  p(ut).  The  specification 
of  the  structural  model  which  comes  from  economic  theory,  imposes  a  lot  of  zero  restrictions 
on  the  B  and  T  coefficients.  In  addition,  there  may  be  cross-equations  or  within  equation 
restrictions.  For  example,  constant  returns  to  scale  of  the  production  function,  or  homogeneity 
of  a  demand  equation,  or  symmetry  conditions.  In  addition,  the  probability  density  function  of 
the  disturbances  may  itself  contain  some  zero  covariance  restrictions.  The  structural  model  given 
in  (A.l)  is  identified  if  these  restrictions  are  enough  to  distinguish  it  from  any  other  structural 
model.  This  is  operationalized  by  proving  that  the  only  nonsingular  matrix  F  which  results  in 
a  new  structural  model  that  satisfies  the  same  restrictions  on  the  original  model  is  the  identity 
matrix.  If  after  imposing  the  restrictions,  only  certain  rows  of  F  resemble  the  corresponding 
rows  of  an  identity  matrix,  up  to  a  scalar  of  normalization,  then  the  corresponding  equations  of 
the  system  are  identified.  The  remaining  equations  are  not  identified.  This  is  the  same  concept 
of  taking  a  linear  combination  of  the  demand  and  supply  equations  and  seeing  if  the  linear 
combination  is  different  from  demand  or  supply.  If  it  is,  then  both  equations  are  identified.  If 
it  looks  like  demand  but  not  supply,  then  supply  is  identified  and  demand  is  not  identified.  Let 
us  look  at  an  example. 

Example  (A.l):  Consider  a  demand  and  supply  equations  with 


Qt  =  a-bPt  +  cYt  +  uH  (A. 3) 

Qt  =  d  +  ePt  +  fWt  +  U2t  (A. 4) 

where  Y  is  income  and  W  is  weather.  Writing  (A. 3)  and  (A. 4)  in  matrix  form  (A.l),  we  get 


B  = 


1  b 

1  — e 


r  = 


=  [1  ,Yt,Wt) 


—a  —  c  0 

~d  0  -/ 

u[  =  (uit,U2t) 


yt  = 


Qt 

Pt 


(A.5) 


There  are  two  zero  restrictions  on  F.  The  first  is  that  income  does  not  appear  in  the  supply 
equation  and  the  second  is  that  weather  does  not  appear  in  the  demand  equation.  Therefore,  the 
order  condition  of  identification  is  satisfied  for  both  equations.  In  fact,  for  each  equation,  there 
is  one  excluded  exogenous  variable  and  only  one  right  hand  side  included  endogenous  variable. 
Therefore,  both  equations  are  just-identified.  Let  F  =  [ fy ]  for  i.  j  =  1,2,  be  a  nonsingular 
matrix.  Premultiply  this  system  by  F.  The  new  matrix  B  is  now  FB  and  the  new  matrix  T 
is  now  FT.  In  order  for  the  transformed  system  to  satisfy  the  same  restrictions  as  the  original 
model,  FB  must  satisfy  the  following  normalization  restrictions: 


fll  +  /12  =  1  /21  +  /22  =  1 


(A.6) 
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Also,  FT  should  satisfy  the  following  zero  restrictions 

-/21c  +  /22 0  =  0  fn0  -  f12f  =  0  (A. 7) 

Since  c  fy  0,  and  /  fy  0,  then  (A. 7)  implies  that  /21  =  /12  =  0.  Using  (A. 6),  we  get  fu  =  /22  =  1. 
Hence,  the  only  nonsingular  F  that  satisfies  the  same  restrictions  on  the  original  model  is  the 
identity  matrix,  provided  c  fy  0  and  /  fy  0.  Therefore,  both  equations  are  identified. 

Example  (A. 2):  If  income  does  not  appear  in  the  demand  equation  (A. 3),  i.e.,  c  =  0,  the  model 
looks  like 

Qt  =  a-bPt  +  uu  (A. 8) 

Qt  =  d  +  ePt  +  fWt  +  u2t  (A. 9) 

In  this  case,  only  the  second  restriction  given  in  (A. 7)  holds.  Therefore,  /  fy  0  implies  /12  =  0, 
however  /21  is  not  necessarily  zero  without  additional  restrictions.  Using  (A. 6),  we  get  fu  =  1 
and  /21  +  /22  =  1-  This  means  that  only  the  first  row  of  F  looks  like  the  first  row  of  an  identity 
matrix,  and  only  the  demand  equation  is  identified.  In  fact,  the  order  condition  for  identification 
is  not  met  for  the  supply  equation  since  there  are  no  excluded  exogenous  variables  from  that 
equation  but  there  is  one  right  hand  side  included  endogenous  variable.  See  problems  6  and  7 
for  more  examples  of  this  method  of  identification. 

Example  (A. 3):  Suppose  that  u  ~  (0,H)  where  H  ®  It,  and  £  =  [0^]  for  i,j  =  1,2.  This 
example  shows  how  a  variance-covariance  restriction  can  help  identify  an  equation.  Let  us  take 
the  model  defined  in  (A. 8),  (A. 9)  and  add  the  restriction  that  0  12  =  021  =  0.  In  this  case,  the 
transformed  model  disturbances  will  be  Fu  ~  (0,  Tl*),  where  Q*  =  £*  <S>  It,  and  £*  =  FTjF' .  In 
fact,  since  £  is  diagonal,  FT,F ’  should  also  be  diagonal.  This  imposes  the  following  restriction 
on  the  elements  of  F: 

/11C11/21  +  /12022/22  =  0  (A. 10) 

But,  fu  =  1  and  /12  =  0  from  the  zero  restrictions  imposed  on  the  demand  equation,  see 
example  4.  Hence,  (A. 10)  reduces  to  011/21  =  0.  Since  011  fy  0,  this  implies  that  /21  =  0, 
and  the  normalization  restriction  given  in  (A. 6),  implies  that  fi2  =  1-  Therefore,  the  second 
equation  is  also  identified. 

Example  (A. 4):  In  this  example,  we  demonstrate  how  cross-equation  restrictions  can  help  iden¬ 
tify  equations.  Consider  the  following  simultaneous  model 

yi  =  a  +  bij2  +  cx\  +  u\  (A. 11) 

y2  =  d  +  eyi  +  fxi  +  gx2  +  u2  (A.12) 

and  add  the  restriction  c  =  /.  It  can  be  easily  shown  that  the  first  equation  is  identified  with 
/n  =  1,  and  /12  =  0.  The  second  equation  has  no  zero  restrictions,  but  the  cross-equation 
restriction  c  =  /  implies: 

-c/11  -  //12  =  -c/21  -  //22 


Using  c  =  /,  we  get 

/11  +  fi2  =  /21  +  /22 


(A.13) 
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But,  the  first  equation  is  identified  with  fn  =  1  and  /12  =  0.  Hence,  (A. 13)  reduces  to  /21+/22  = 
1,  which  together  with  the  normalization  condition  — /21  b  +  f22  =  1,  gives  /21(f)  +  1)  =  0.  If 
b  —  1,  then  /21  =  0  and  f^2  =  1-  The  second  equation  is  also  identified  provided  b  7^  —  1. 

Alternatively,  one  can  look  at  the  problem  of  identification  by  asking  whether  the  structural 
parameters  in  B  and  T  can  be  obtained  from  the  reduced  form  parameters.  It  will  be  clear 
from  the  following  discussion  that  this  task  is  impossible  if  there  are  no  restrictions  on  this 
simultaneous  model.  In  this  case,  the  model  is  hopelessly  unidentified.  However,  in  the  usual 
case  where  there  are  a  lot  of  zeros  in  B  and  T,  we  may  be  able  to  retrieve  the  remaining  non-zero 
coefficients  from  n.  More  rigorously,  n  =  — H_1T,  which  implies  that 

nn  +  r  =  o  (A. 14) 

or 

AW  =  0  where  A=[B,F]  and  W'  =  [U',IK]  (A.15) 

For  the  first  equation,  this  implies  that 

a'iW  =  0  where  a\  is  the  first  row  of  A.  (A.  16) 

W  is  known  (or  can  be  estimated)  and  is  of  rank  K .  If  the  first  equation  has  no  restrictions  on  its 
structural  parameters,  then  o\  contains  (G  +  K)  unknown  coefficients.  These  coefficients  satisfy 
K  homogeneous  equations  given  in  (A. 16).  Without  further  restrictions,  we  cannot  solve  for 
( G  +  K )  coefficients  [a\ )  with  only  I\  equations.  Let  4>  denote  the  matrix  of  R  zero  restrictions 
on  the  first  equation,  i.e. ,  a'^<j>  =  0.  This  together  with  (A. 16)  implies  that 

aq  [W,  0]  =  0  (A.  17) 

and  we  can  solve  uniquely  for  a\  (up  to  a  scalar  of  normalization)  provided  the 

rank  [W,  (j)\  =  G  +  K  —  1  (A.  18) 

Economists  specify  each  structural  equation  with  the  left  hand  side  endogenous  variable  having 
the  coefficient  one.  This  normalization  identifies  one  coefficient  of  a\ ,  therefore,  we  require  only 
(G  +  K  —  1)  more  restrictions  to  uniquely  identify  the  remaining  coefficients  of  aq.  [W,  (j)}  is  a 
( G  +  K )  x  ( K  +  R)  matrix.  Its  rank  is  less  than  any  of  its  two  dimensions,  i.e.,  ( K  +  R)  > 
(G  +  K  —  1),  which  results  in  R  >  G  —  1,  or  the  order  condition  of  identification.  Note  that 
this  is  a  necessary  but  not  sufficient  condition  for  (A.  18)  to  hold.  It  states  that  the  number 
of  restrictions  on  the  first  equation  must  be  greater  than  the  number  of  endogenous  variables 
minus  one.  If  all  R  restrictions  are  zero  restrictions,  then  it  means  that  the  number  of  excluded 
exogenous  plus  the  number  of  excluded  endogenous  variables  should  be  greater  than  (G  —  1). 
But  the  G  endogenous  variables  are  made  up  of  the  left  hand  side  endogenous  variable  y  1,  the 
g\  right  hand  side  included  endogenous  variables  Y\,  and  (G  —  g\  —  1)  excluded  endogenous 
variables.  Therefore,  R  >  (G  —  1)  can  be  written  as  &2  +  (G  —  g\  —  1)  >  (G  —  1)  which  reduces 
to  /u’2  >  g  1,  which  was  discussed  earlier  in  this  chapter. 

The  necessary  and  sufficient  condition  for  identification  can  now  be  obtained  as  follows:  Using 
(A.l)  one  can  write 


Azt  =  ut  where  z't  =  ( y[,x't ) 


(A.19) 


302 


Chapter  11:  Simultaneous  Equations  Model 


and  from  the  first  definition  of  identification  we  make  the  transformation  F Azt  =  Fut,  where 
F  is  a  G  x  G  nonsingular  matrix.  The  first  equation  satisfies  the  restrictions  a\  <f>  =  0,  which 
can  be  rewritten  as  i! Acj>  =  0,  where  l'  is  the  first  row  of  an  identity  matrix  Iq.  F  must  satisfy 
the  restriction  that  (first  row  of  F A)<j>  =  0.  But  the  first  row  of  FA  is  the  first  row  of  F,  say 
/{,  times  A.  This  means  that  f[(Acf>)  =  0.  For  the  first  equation  to  be  identified,  this  condition 
on  the  transformed  first  equation  must  be  equivalent  to  if  Ac!)  =  0,  up  to  a  scalar  constant.  This 
holds  if  and  only  if  /{  is  a  scalar  multiple  of  l'  .  and  the  latter  condition  holds  if  and  only  if  the 
rank  (Ac/))  =  G  —  1.  The  latter  is  known  as  the  rank  condition  for  identification. 

Example  (A. 5):  Consider  the  simple  Keynesian  model  given  in  example  1.  The  second  equation 
is  an  identity  and  the  first  equation  satisfies  the  order  condition  of  identification,  since  It  is  the 
excluded  exogenous  variable  from  that  equation,  and  there  is  only  one  right  hand  side  included 
endogenous  variable  Yt.  In  fact,  the  first  equation  is  just-identified.  Note  that 


A=[B,T] 


Pll 

P12  7n  712 

1  —  (3  —a  0 

_  P21 

P22  721  722 

-11  0-1 

(A. 20) 


and  4>  for  the  first  equation  consists  of  only  one  restriction,  namely  that  It  is  not  in  that 
equation,  or  712  =  0.  This  makes  <p'  =  (0,0,  0,1),  since  a\(f>  =  0  gives  q12  =  0.  From  (A. 20), 
A(f>  =  (7121722/  =  (0,  —  1/  and  the  rank  (Acj))  =  1  =  G  —  1.  Hence,  the  rank  condition  holds 
for  the  first  equation  and  it  is  identified.  Problem  8  reconsiders  example  (A.l),  where  both 
equations  are  just-identified  by  the  order  condition  of  identification  and  asks  the  reader  to  show 
that  both  satisfy  the  rank  condition  of  identification  as  long  as  c  7^  0  and  /  7^  0. 

The  reduced  form  of  the  simple  Keynesian  model  is  given  in  equations  (11.4)  and  (11.5).  In 
fact, 


77n 

7712 

a 

P  ' 

7721 

7722 

a 

1 

/(!-/? 


Note  that 

7Tii/7T22  =  ot  and  7121/^22  =  ct 

77 12 /t722  =  P  and  (7722  -  1)/t722  =  /3 


(A.21) 


(A. 22) 


Therefore,  the  structural  parameters  of  the  consumption  equation  can  be  retrieved  from  the 
reduced  form  coefficients.  However,  what  happens  if  we  replace  n  by  its  OLS  estimate  n ols? 
Would  the  solution  in  (A. 22)  lead  to  two  estimates  of  (cc,  (3)  or  would  this  solution  lead  to  a 
unique  estimate?  In  this  case,  the  consumption  equation  is  just-identified  and  the  solution  in 
(A. 22)  is  unique.  To  show  this,  recall  that 


7712  =  mci/mn;  and  7t22  =  myi/mu 
Solving  for  (3,  using  (A. 22),  one  gets 
P  =  7712/7722  =  mci/myi 


(A. 23) 


(A. 24) 


and 


P  =  (t?22  -  1)/t?22  =  (rnci  -  myi)/myi 


(A. 25) 
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(A. 24)  and  (A. 25)  lead  to  a  unique  solution  because  equation  (11.2)  gives 


Tflyi  —  Tflci  T  TTla 


(A. 26) 


In  general,  we  would  not  be  able  to  solve  for  the  structural  parameters  of  an  unidentified 
equation  in  terms  of  the  reduced  form  parameters.  However,  when  this  equation  is  identified, 
replacing  the  reduced  form  parameters  by  their  OLS  estimates  would  lead  to  a  unique  estimate 
of  the  structural  parameters,  only  if  this  equation  is  just-identified,  and  to  more  than  one 
estimate  depending  upon  the  degree  of  over-identification.  Problem  8  gives  another  example  of 
the  just-identified  case,  while  problem  9  considers  a  model  with  one  unidentified  and  another 
over-identified  equation. 

Example  (A. 6):  Equations  (11.13)  and  (11.14)  give  an  unidentified  demand  and  supply  model 
with 


B  = 


1  -/3 
1  -6 


and  T  = 


—a 

-7 


The  reduced  form  equations  given  by  (11.16)  and  (11.17)  yield 

/(6-P) 


n  =  -B_1r  = 

Tin 

— 

afi  —  7/3 

7T21 

a  —  7 

(A. 27) 


(A. 28) 


Note  that  one  cannot  solve  for  (a, /3)  nor  (7 ,6)  in  terms  of  (7rn,7r2i)  without  further  restric¬ 
tions. 


CHAPTER  12 

Pooling  Time-Series  of  Cross-Section  Data 

12.1  Introduction 

In  this  chapter,  we  will  consider  pooling  time-series  of  cross-sections.  This  may  be  a  panel 
of  households  or  firms  or  simply  countries  or  states  followed  over  time.  Two  well  known  ex¬ 
amples  of  panel  data  in  the  U.S.  are  the  Panel  Study  of  Income  Dynamics  (PSID)  and  the 
National  Longitudinal  Survey  (NLS).  The  PSID  began  in  1968  with  4802  families,  including 
an  over-sampling  of  poor  households.  Annual  interviews  were  conducted  and  socioeconomic 
characteristics  of  each  of  the  families  and  of  roughly  31000  individuals  who  have  been  in  these 
or  derivative  families  were  recorded.  The  list  of  variables  collected  is  over  5000.  The  NLS,  fol¬ 
lowed  five  distinct  segments  of  the  labor  force.  The  original  samples  include  5020  older  men, 
5225  young  men,  5083  mature  women,  5159  young  women  and  12686  youths.  There  was  an 
over-sampling  of  blacks,  hispanics,  poor  whites  and  military  in  the  youths  survey.  The  list  of 
variables  collected  runs  into  the  thousands.  An  inventory  of  national  studies  using  panel  data  is 
given  at  http:/ /www. isr.umich.edu/src/psid/panelstudies.html.  Pooling  this  data  gives  a  richer 
source  of  variation  which  allows  for  more  efficient  estimation  of  the  parameters.  With  additional, 
more  informative  data,  one  can  get  more  reliable  estimates  and  test  more  sophisticated  behav¬ 
ioral  models  with  less  restrictive  assumptions.  Another  advantage  of  panel  data  sets  are  their 
ability  to  control  for  individual  heterogeneity.  Not  controlling  for  these  unobserved  individual 
specific  effects  leads  to  bias  in  the  resulting  estimates.  Panel  data  sets  are  also  better  able  to 
identify  and  estimate  effects  that  are  simply  not  detectable  in  pure  cross-sections  or  pure  time- 
series  data.  In  particular,  panel  data  sets  are  better  able  to  study  complex  issues  of  dynamic 
behavior.  For  example,  with  a  cross-section  data  set  one  can  estimate  the  rate  of  unemployment 
at  a  particular  point  in  time.  Repeated  cross-sections  can  show  how  this  proportion  changes 
over  time.  Only  panel  data  sets  can  estimate  what  proportion  of  those  who  are  unemployed  in 
one  period  remain  unemployed  in  another  period.  Some  of  the  benefits  and  limitations  of  using 
panel  data  sets  are  listed  in  Hsiao  (2003)  and  Baltagi  (2008).  Section  12.2  studies  the  error  com¬ 
ponents  model  focusing  on  fixed  effects,  random  effects  and  maximum  likelihood  estimation. 
Section  12.3  considers  the  question  of  prediction  in  a  random  effects  model,  while  Section  12.4 
illustrates  the  estimation  methods  using  an  empirical  example.  Section  12.5  considers  testing 
the  poolability  assumption,  the  existence  of  random  individual  effects  and  the  consistency  of 
the  random  effects  estimator  using  a  Hausman  test.  Section  12.6  studies  the  dynamic  panel 
data  model  and  illustrates  the  methods  used  with  an  empirical  example.  Section  12.7  concludes 
with  a  short  presentation  of  program  evaluation  and  the  difference-in-differences  estimator. 


12.2  The  Error  Components  Model 

The  regression  model  is  still  the  same,  but  it  now  has  double  subscripts 


Hit  —  ol  +  X'itP  +  uit 


(12.1) 
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where  i  denotes  cross-sections  and  t  denotes  time-periods  with  i  =  1,2 , ...,1V,  and  t  =  1,2, 
. . . ,  T.  a  is  a  scalar,  /3  is  K  x  1  and  Xu  is  the  it- th  observation  on  K  explanatory  variables. 
The  observations  are  usually  stacked  with  i  being  the  slower  index,  i.e. ,  the  T  observations  on 
the  first  household  followed  by  the  T  observations  on  the  second  household,  and  so  on,  until  we 
get  to  the  IV-th  household.  Under  the  error  components  specification,  the  disturbances  take  the 
form 


Hu  —  /A  T  r'u  (12.2) 

where  the  /q’s  are  cross-section  specific  components  and  vu  are  remainder  effects.  For  example, 
//?  may  denote  individual  ability  in  an  earnings  equation,  or  managerial  skill  in  a  production 
function  or  simply  a  country  specific  effect.  These  effects  are  time-invariant. 

In  vector  form,  (12.1)  can  be  written  as 

y  =  atNT  T-  X/3  +  u  =  Z6  +  u  (12.3) 

where  y  is  NT  x  1,  X  is  NT  x  K,  Z  =  [lnt,X],  8'  =  ( a' , /3 '),  and  lnt  is  a  vector  of  ones  of 
dimension  NT.  Also,  (12.2)  can  be  written  as 

u  =  Z^y-\-v  (12.4) 

where  v!  =  («n, . . . ,  u\t,  U21,  ■  ■  ■ ,  U2T,  ■  ■  ■  ,  njvi,  ■  ■  ■ ,  unt)  and  =  In  0  lt-  In  is  an  identity 
matrix  of  dimension  N,  lt  is  a  vector  of  ones  of  dimension  T,  and  0  denotes  Kronecker  product 
defined  in  the  Appendix  to  Chapter  7.  Z ^  is  a  selector  matrix  of  ones  and  zeros,  or  simply  the  ma¬ 
trix  of  individual  dummies  that  one  may  include  in  the  regression  to  estimate  the  /q’s  if  they  are 
assumed  to  be  fixed  parameters.  //  =  (^l5 . . .  ,fiN)  and  v'  =  (z^n, . . . ,  v\ t,  . . . ,  iqv  1,  ■  ■  ■ ,  vnt)- 
Note  that  Z^Z'^  =  In®Jt  where  Jt  is  a  matrix  of  ones  of  dimension  T,  and  P  =  Zlj(Z,jlZll)^]  Z'^. 
the  projection  matrix  on  Zj} ,  reduces  to  P  =  In®Jt  where  Jt  =  Jt/T.  P  is  a  matrix  which  aver¬ 
ages  the  observation  across  time  for  each  individual,  and  Q  =  I^t  —  P  is  a  matrix  which  obtains 
the  deviations  from  individual  means.  For  example,  Pu  has  a  typical  element  Ui_  =  YlJ=i  Uit/T 
repeated  T  times  for  each  individual  and  Qu  has  a  typical  element  (uu  —  Ui.).  P  and  Q  are  (i) 
symmetric  idempotent  matrices,  i.e.,  P'  =  P  and  P2  =  P.  This  means  that  the  rank  (P)  = 
tr(P)  =  N  and  rank  (Q)  =  tr(Q)  =  N(T  —  1).  This  uses  the  result  that  rank  of  an  idempotent 
matrix  is  equal  to  its  trace,  see  Graybill  (1961,  Theorem  1.63)  and  the  Appendix  to  Chapter 
7.  Also,  (ii)  P  and  Q  are  orthogonal,  i.e.,  PQ  =  0  and  (iii)  they  sum  to  the  identity  matrix 
P  +  Q  =  I nt-  In  fact,  any  two  of  these  properties  imply  the  third,  see  Graybill  (1961,  Theorem 
1.68). 

12.2.1  The  Fixed  Effects  Model 

If  the  /q’s  are  thought  of  as  fixed  parameters  to  be  estimated,  then  equation  (12.1)  becomes 
Hit  =  ot  +  X'it(3  +  YaLi  Ti Di  +  z'it  (12.5) 

where  Di  is  a  dummy  variable  for  the  Pth  household.  Not  all  the  dummies  are  included  so  as 
not  to  fall  in  the  dummy  variable  trap.  One  is  usually  dropped  or  equivalently,  we  can  say  that 
there  is  a  restriction  on  the  /ds  given  by  Yli=  1  Ti  =  0-  The  uu  s  are  the  usual  classical  IID  ran¬ 
dom  variables  with  0  mean  and  variance  a2.  OLS  on  equation  (12.5)  is  BLUE,  but  we  have  two 
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problems,  the  first  is  the  loss  of  degrees  of  freedom  since  in  this  case,  we  are  estimating  N  +  K 
parameters.  Also,  with  a  lot  of  dummies  we  could  be  running  into  multicollinearity  problems 
and  a  large  X'X  matrix  to  invert.  For  example,  if  N  =  50  states,  T  =  10  years  and  we  have 
two  explanatory  variables,  then  with  500  observations  we  are  estimating  52  parameters.  Alter¬ 
natively,  we  can  think  of  this  in  an  analysis  of  variance  context  and  rearrange  our  observations, 
say,  on  y  in  an  (AT  x  T)  matrix  where  rows  denote  firms  and  columns  denote  time  periods. 


1 

2 

t 

T 

1 

2/n 

2/12 

2/IT 

2/1. 

i  2 

2/21 

2/22 

2/2  T 

2/2. 

N 

UNI 

2/1V2 

••  2 JNT 

2 IN. 

where  yi.  =  Y^=iVit  and  y%.  =  Vi./T.  For  the  simple  regression  with  one  regressor,  the  model 
given  in  (12.1)  becomes 

Vit  =  «  +  fix  it  +  ^  +  Vit  (12.6) 

averaging  over  time  gives 

Vi.  =  a  +  (3xi .  +  ^  +  Vi.  (12-7) 

and  averaging  over  all  observations  gives 

y..  =  a  +  /3x..  +  v  (12.8) 

where  y ..  =  J2iLi  Ylt= l  Vit /NT.  Equation  (12.8)  follows  because  the  y)s  sum  to  zero.  Defining 
Vit  =  {yu  ~  Vi.)  and  xit  and  vit  similarly,  we  get 

Uit  -  Vi.  =  (3(xit  -  Xi.)  +  ( vit  -  Vi) 

or 

Vit  —  )xn  -\-  vn  (12.9) 

Running  OLS  on  equation  (12.9)  leads  to  the  same  estimator  of  (3  as  that  obtained  from  equation 
(12.5).  This  is  called  the  least  squares  dummy  variable  estimator  (LSDV)  or  /3  in  our  notation. 
It  is  also  known  as  the  Within  estimator  since  YliLi  Ylt= l  %it  the  within  sum  of  squares  in  an 
analysis  of  variance  framework.  One  can  then  retrieve  an  estimate  of  a  from  equation  (12.8)  as 
a  =  y..  ~  fix..-  Similarly,  if  we  are  interested  in  the  /q’s,  those  can  also  be  retrieved  from  (12.7) 
and  (12.8)  as  follows: 

Vi  =  {yi.-y.)-P{xi.-x.)  (12.10) 

In  matrix  form,  one  can  substitute  the  disturbances  given  by  (12.4)  into  (12.3)  to  get 

y  =  cunt  X(3  +  +  v  =  Z8  +  Z^y  +  v  (12.11) 

and  then  perform  OLS  on  (12.11)  to  get  estimates  of  a,  (3  and  y.  Note  that  Z  is  NT  x  ( I\  +  1) 
and  Zp,  the  matrix  of  individual  dummies  is  NT  x  N.  If  N  is  large,  (12.11)  will  include  too 
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many  individual  dummies,  and  the  matrix  to  be  inverted  by  OLS  is  large  and  of  dimension 
(N  +  K ).  In  fact,  since  a  and  0  are  the  parameters  of  interest,  one  can  obtain  the  least  squares 
dummy  variables  (LSDV)  estimator  from  (12.11),  by  residualing  out  the  Z/t  variables,  i.e. ,  by 
premultiplying  the  model  by  Q ,  the  orthogonal  projection  of  Z^,  and  performing  OLS 

Qy  =  QX(3  +  Qv  (12.12) 

This  uses  the  fact  that  QZ M  =  Qlnt  =  0,  since  PZ /  =  Z^.  In  other  words,  the  Q  matrix 
wipes  out  the  individual  effects.  Recall,  the  FWL  Theorem  in  Chapter  7.  This  is  a  regression 
of  y  =  Qy  with  typical  element  (yu  —  yi.)  on  X  =  QX  with  typical  element  (Xit^  —  XtQ  for 
the  fc-th  regressor,  k  =  1,  2, . . . ,  K.  This  involves  the  inversion  of  a  (K  x  I\ )  matrix  rather  than 
( N  +  K)  x  (N  +  K )  as  in  (12.11).  The  resulting  OLS  estimator  is 

0  =  {X'QX)-lX'Qy  (12.13) 

with  var(/3)  =  u 2(X'QX)~l  =  o2(X'X)~l. 

Note  that  this  fixed  effects  (FE)  estimator  cannot  estimate  the  effect  of  any  time-invariant 
variable  like  sex,  race,  religion,  schooling,  or  union  participation.  These  time-invariant  variables 
are  wiped  out  by  the  Q  transformation,  the  deviations  from  means  transformation.  Alterna¬ 
tively,  one  can  see  that  these  time-invariant  variables  are  spanned  by  the  individual  dummies 
in  (12.5)  and  therefore  any  regression  package  attempting  (12.5)  will  fail,  signaling  perfect  mul- 
ticollinearity.  If  (12.5)  is  the  true  model,  LSDV  is  BLUE  as  long  as  va  is  the  standard  classical 
disturbance  with  mean  0  and  variance  covariance  matrix  cr2/^.  Note  that  as  T  — >  oo,  the  FE 
estimator  is  consistent.  However,  if  T  is  fixed  and  N  — »  oo  as  typical  in  short  labor  panels,  then 
only  the  FE  estimator  of  0  is  consistent,  the  FE  estimators  of  the  individual  effects  ( a  +  gf) 
are  not  consistent  since  the  number  of  these  parameters  increase  as  N  increases. 


Testing  for  Fixed  Effects:  One  could  test  the  joint  significance  of  these  dummies,  i.e.,  Hq ; 
di  =  /U  =  ••  =  ALv-i  =  0)  by  performing  an  F-test.  This  is  a  simple  Chow  test  given  in  (4.17) 
with  the  restricted  residual  sums  of  squares  (RRSS)  being  that  of  OLS  on  the  pooled  model 
and  the  unrestricted  residual  sums  of  squares  (URSS)  being  that  of  the  LSDV  regression.  If  N 
is  large,  one  can  perform  the  within  transformation  and  use  that  residual  sum  of  squares  as  the 
URSS.  In  this  case 


_  (RRSS  -  URSS)/(N  -  1)  H0  „ 

~  URSS /(NT  -N-  K)  jv-tjvct-i  )-k 


(12.14) 


Computational  Warning:  One  computational  caution  for  those  using  the  Within  regression  given 
by  (12.12).  The  s2  of  this  regression  as  obtained  from  a  typical  regression  package  divides  the 
residual  sums  of  squares  by  NT  —  K  since  the  intercept  and  the  dummies  are  not  included. 
The  proper  s2,  say  s*2  from  the  LSDV  regression  in  (12.5)  would  divide  the  same  residual  sums 
of  squares  by  N(T  —  1)  —  K.  Therefore,  one  has  to  adjust  the  variances  obtained  from  the 
within  regression  (12.12)  by  multiplying  the  variance-covariance  matrix  by  (s*2 / s2)  or  simply 
by  multiplying  by  [NT  —  K]/[N(T  —  1)  —  K], 


12.2.2  The  Random  Effects  Model 

There  are  too  many  parameters  in  the  fixed  effects  model  and  the  loss  of  degrees  of  freedom 
can  be  avoided  if  the  g/s  can  be  assumed  random.  In  this  case  IID(0, a2),  vit  ~IID(0,a2) 
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and  the  /q’ s  are  independent  of  the  ze^’s.  In  addition,  the  Xy  s  are  independent  of  the  /q’s  and 
z'it’s  for  all  i  and  t.  The  random  effects  model  is  an  appropriate  specification  if  we  are  drawing 
N  individuals  randomly  from  a  large  population. 

This  specification  implies  a  homoskedastic  variance  var(iijj)  =  cr2  +  cr2  for  all  i  and  2,  and 
an  equi-correlated  block-diagonal  covariance  matrix  which  exhibits  serial  correlation  over  time 
only  between  the  disturbances  of  the  same  individual.  In  fact, 

co  v(uit,UjS)  =  cr^  +  al  for  i  =  j,t  =  s  (12.15) 

=  for  i  =  j,t^s 

and  zero  otherwise.  This  also  means  that  the  correlation  coefficient  between  uu  and  UjS  is 

p  =  coTiel(uit,Ujs)  =  1  for  i  =  j,t  =  s  (12.16) 

=  +  ol)  for  i  =  3,t¥:s 

and  zero  otherwise.  From  (12.4),  one  can  compute  the  variance-covariance  matrix 

12  =  E(uu')  =  Z/JiE(p,pLl)Z'li  +  E(vv')  =  cr2(ijv  0  Jt)  +  &1(In  0  It)  (12.17) 

In  order  to  obtain  the  GLS  estimator  of  the  regression  coefficients,  we  need  12_1.  This  is  a  huge 
matrix  for  typical  panels  and  is  of  dimension  [NT  x  NT).  No  brute  force  inversion  should  be 
attempted  even  if  the  researcher’s  application  has  a  small  N  and  T.  For  example,  if  we  observe 
N  =  20  firms  over  T  =  5  time  periods,  12  will  be  100  by  100.  We  will  follow  a  simple  trick  devised 
by  Wansbeek  and  Kapteyn  (1982)  that  allows  the  deviation  of  12_1  and  12-1/2.  Essentially,  one 
replaces  Jt  by  T Jt ,  and  It  by  (Et  +  Jt)  where  Et  is  by  definition  (It  —  Jt)-  In  this  case: 

G  =  Tcr^/jv  0  Jt)  +  &1(In  0  Et)  +  v1(In  0  Jt) 

collecting  terms  with  the  same  matrices,  we  get 


n  =  (Ter2  -1-  al)(IN  <g>  JT)  +  cr2(/jv  0  ET)  =  ajP  +  a2uQ  (12.18) 

where  cr2  =  Ter2  +  cr2.  (12.18)  is  the  spectral  decomposition  representation  of  11,  with  cr2 
being  the  first  unique  characteristic  root  of  12  of  multiplicity  N  and  cr2  is  the  second  unique 
characteristic  root  of  12  of  multiplicity  N(T  —  1).  It  is  easy  to  verify,  using  the  properties  of  P 
and  Q,  that 


12_1  =  \p+\q 


o'! 


erf 


and 


12"1/2  =  —P+—Q 
cr  i  cru 


(12.19) 


(12.20) 


In  fact,  12r  =  (cr\)rP  +  (a2)rQ  where  r  is  an  arbitrary  scalar.  Now  we  can  obtain  GLS  as 
a  weighted  least  squares.  Fuller  and  Battese  (1974)  suggested  premultiplying  the  regression 
equation  given  in  (12.3)  by  fj^G”1/2  =  Q  +  (au/a\)P  and  performing  OLS  on  the  resulting 
transformed  regression.  In  this  case,  y*  =  cr„12_1/2y  has  a  typical  element  yn  —  Oyi,  where 
9  =  1  —  (au/a i).  This  transformed  regression  inverts  a  matrix  of  dimension  (K  +  1)  and  can  be 
easily  implemented  using  any  regression  package. 
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The  Best  Quadratic  Unbiased  (BQU)  estimators  of  the  variance  components  arise  naturally 
from  the  spectral  decomposition  of  fi.  In  fact,  Pu  ~  (0,  erfP)  and  Qu  ~  (0,  crfQ)  and 


.9  u' Pu 

1  =  W) 


TZhul/N 


(12.21) 


and 


u'Qu 
tr  (Q) 


^Ef=i  ES=i (“it  -  Uif/N{T  -  1) 


(12.22) 


provide  the  BQU  estimators  of  erf  and  af,  respectively,  see  Balestra  (1973). 

These  are  analysis  of  variance  type  estimators  of  the  variance  components  and  are  MVU  under 
normality  of  the  disturbances,  see  Graybill  (1961).  The  true  disturbances  are  not  known  and 
therefore  (12.21)  and  (12.22)  are  not  feasible.  Wallace  and  Hussain  (1969)  suggest  substituting 
OLS  residuals  uols  instead  of  the  true  it’s.  After  all,  the  OLS  estimates  are  still  unbiased  and 
consistent,  but  no  longer  efficient.  Amemiya  (1971)  shows  that  these  estimators  of  the  variance 
components  have  a  different  asymptotic  distribution  from  that  knowing  the  true  disturbances. 
Hejsuggests  using  the  LSDV  residuals  instead  of  the  OLS  residuals.  In  this  case  u  =  y  —  cunt  ~ 
X/3  where  a  =  y,,  —  X'  (5  and  X'  is  a  1  xK  vector  of  averages  of  all  regressors.  Substituting  these 
ft’s  for  u  in  (12.21)  and  (12.22)  we  get  the  Amemiya-type  estimators  of  the  variance  components. 
The  resulting  estimates  of  the  variance  components  have  the  same  asymptotic  distribution  as 
that  knowing  the  true  disturbances. 

Swamy  and  Arora  (1972)  suggest  running  two  regressions  to  get  estimates  of  the  variance 
components  from  the  corresponding  mean  square  errors  of  these  regressions.  The  first  regression 
is  the  Within  regression,  given  in  (12.12),  which  yields  the  following  s 2: 


K  =  [y'Qy  -  y'QX{x'QX)~lx'Qy\/[N(T  -  l)  -  K\ 


(12.23) 


The  second  regression  is  the  Between  regression  which  runs  the  regression  of  averages  across 
time,  i.e., 

Vi.  =  a  +  X[(3  +  Ui.  i  =  l,...,N  (12.24) 

This  is  equivalent  to  premultiplying  the  model  in  (12.11)  by  P  and  running  OLS.  The  only 
caution  is  that  the  latter  regression  has  NT  observations  because  it  repeats  the  averages  T  times 
for  each  individual,  while  the  cross-section  regression  in  (12.24)  is  based  on  N  observations.  To 
remedy  this,  one  can  run  the  cross-section  regression 

yi./Vf  =  a(Vf)  +  (X'jVT)0  +  Ui./V T  (12.25) 

where  one  can  easily  verify  that  var (ui./\/T)  =  erf.  This  regression  will  yield  an  s2  given  by 

=  (y'Py  -  y'PZ(Z'PZ)-1Z'Py)/(N  -  I<  -  1)  (12.26) 


Note  that  stacking  the  following  two  transformed  regressions  we  just  performed  yields 


Qy  \ 
Py  ) 


Qu  \ 
Pu  ) 


(12.27) 
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and  the  transformed  error  has  mean  0  and  variance-covariance  matrix  given  by 

(  °lQ  0  \ 

VO  °\p) 

Problem  6  asks  the  reader  to  verify  that  OLS  on  this  system  of  2NT  observations  yields  OLS 
on  the  pooled  model  (12.3).  Also,  GLS  on  this  system  yields  GLS  on  (12.3).  Alternatively,  one 
could  get  rid  of  the  constant  a  by  running  the  following  stacked  regressions: 


(  Qy  _  \  =  (  QX  \a,(Qu 

V  (P  -  Jnt)v  )  \  {P  —  Jnt)X  J  \  (P  —  Jnt)u 


(12.28) 


This  follows  from  the  fact  the  Qlnt  =  0  and  ( P 
zero  mean  and  variance-covariance  matrix 

(  °lQ  0  \ 

V  0  a2(P-JNT )  J 

OLS  on  this  system,  yields  OLS  on  (12.3)  and  GLS  on  (12.28)  yields  GLS  on  (12.3).  In  fact, 

Pgls  =  [{X'QX/cl)  +  X'{P-JNT)X/cj2l\-l[{XlQylal)  +  {X\P-3NT)y/al)} 

=  [Wxx  +  ^BxxV^Wxy  +  ^Bxy]  (12.30) 

with  var 0Gls)  =  <?l[Wx x  +  <\>2BXx\~X.  Note  that  Wjx  =  X'QX,  BXX  =  X\P  -  JNT)X 
and  cj)2  =  a2/a\.  Also,  the  Within  estimator  of  /3  is  (3 within  =  and  the  Between 

estimator  (3Between  =  B^}xBxy-  This  shows  that  Pqls  is  a  matrix  weighted  average  of  Pwuhin 
and  (3 Between  weighing  each  estimate  by  the  inverse  of  its  corresponding  variance.  In  fact 

PgLS  =  WiPwithin  +  ^3  Between  (12.31) 

where  Wi  =  [WXx  +  Bxxj^Wxx  and  W2  =  [WXx  +  =l~W i-  This 

was  demonstrated  by  Maddala  (1971).  Note  that  (i)  if  cP  =  0,  then  (j)2  =  1  and  3gls  reduces 

to  Pols ■  Jo)  If  T  -»•  oo,  then  4>2  -*•  0  and  PGLS  tends  to  3Within.  (iii)  If  4>2  ->  oo,  then  PGLS 
tends  to  /? Between •  In  other  words,  the  Within  estimator  ignores  the  between  variation,  and 
the  Between  estimator  ignores  the  within  variation.  The  OLS  estimator  gives  equal  weight  to 
the  between  and  within  variations.  From  (12.30),  it  is  clear  that  var (3within)~  var(/^GLs)  a 
positive  semi-definite  matrix,  since  p2  is  positive.  However  as  T  — >  oo  for  any  fixed  IV,  cj)2  — >  0 
and  both  /3GBs  and  Pwuhin  have  the  same  asymptotic  variance. 

Another  estimator  of  the  variance  components  was  suggested  by  Nerlove  (1971).  His  sugges¬ 
tion  is  to  estimate  a2  =  XV=i (Tli  ~  A)2/(N  —  1)  where  /q  are  the  dummy  coefficients  estimates 
from  the  LSDV  regression,  a2  is  estimated  from  the  within  residual  sums  of  squares  divided  by 
NT  without  correction  for  degrees  of  freedom. 

Note  that,  except  for  Nerlove’s  (1971)  method,  one  has  to  retrieve  d2^  as  (a\  —  a2) /T.  In  this 
case,  there  is  no  guarantee  that  the  estimate  of  a2  would  be  non-negative.  Searle  (1971)  has 
an  extensive  discussion  of  the  problem  of  negative  estimates  of  the  variance  components  in  the 
biometrics  literature.  One  solution  is  to  replace  these  negative  estimates  by  zero.  This  in  fact  is 
the  suggestion  of  the  Monte  Carlo  study  by  Maddala  and  Mount  (1973).  This  study  finds  that 
negative  estimates  occurred  only  when  the  true  a2  was  small  and  close  to  zero.  In  these  cases 


—  Jnt)lnt  =  0.  The  transformed  error  has 

(12.29) 
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OLS  is  still  a  viable  estimator.  Therefore,  replacing  negative  a2  by  zero  is  not  a  bad  sin  after 
all,  and  the  problem  is  dismissed  as  not  being  serious. 

Under  the  random  effects  model,  GLS  based  on  the  true  variance  components  is  BLUE,  and 
all  the  feasible  GLS  estimators  considered  are  asymptotically  efficient  as  either  N  or  T  — »  oo. 
Maddala  and  Mount  (1973)  compared  OLS,  Within,  Between,  feasible  GLS  methods,  true  GLS 
and  MLE  using  their  Monte  Carlo  study.  They  found  little  to  choose  among  the  various  feasible 
GLS  estimators  in  small  samples  and  argued  in  favor  of  methods  that  were  easier  to  compute. 

Taylor  (1980)  derived  exact  finite  sample  results  for  the  one-way  error  components  model.  He 
compared  the  Within  estimator  with  the  Swanry-Arora  feasible  GLS  estimator.  He  found  the 
following  important  results:  (1)  Feasible  GLS  is  more  efficient  than  FE  for  all  but  the  fewest 
degrees  of  freedom.  (2)  The  variance  of  feasible  GLS  is  never  more  than  17%  above  the  Cramer- 
Rao  lower  bound.  (3)  More  efficient  estimators  of  the  variance  components  do  not  necessarily 
yield  more  efficient  feasible  GLS  estimators.  These  finite  sample  results  are  confirmed  by  the 
Monte  Carlo  experiments  carried  out  by  Maddala  and  Mount  (1973)  and  Baltagi  (1981). 


12.2.3  Maximum  Likelihood  Estimation 


Under  normality  of  the  disturbances,  one  can  write  the  log-likelihood  function  as 


L(a,  (3,  cj)2 ,  a2)  =  constant 


NT 

2 


log  a2u  + 


1 

2al 


u'Ti  1u 


(12.32) 


where  H  =  er2E,  cj)2  =  a 2/crf  and  E  =  Q  +  <f>~2P  from  (12.18).  This  uses  the  fact  that  |H|  = 
product  of  its  characteristic  roots  =  (a2)N^T~1\a2)N  =  (a2)NT {())2)~ N .  Note  that  there  is  a 
one-to-one  correspondence  between  eft 2  and  <r2.  In  fact,  0  <  <r2  <  oo  translates  into  0  <  4>2  <  1. 
Brute  force  maximization  of  (12.32)  leads  to  nonlinear  first-order  conditions,  see  Amemiya 
(1971).  Instead,  Breusch  (1987)  concentrates  the  likelihood  with  respect  to  a  and  cr2.  In  this 
case,  cvmle  =  V..  ~  X' J3mle  and  a2  Mle  =  u'T~lu/NT  where  u  and  E  are  based  on  MLE’s  of 
/3,  4> 2  and  a.  Let  d  =  y  —  Xf3MLE  then  olmle  =  b'NTd/NT  and  u  =  d  —  b^TOt  =  d  —  J^Td.  This 
implies  that  d2  MLE  can  be  rewritten  as 


*1,mle  =  d![Q  +  cj)2(P  -  JntMNT 


(12.33) 


and  the  concentrated  log-likelihood  becomes 

NT  -  N 

Lc(/3,  4>2)  =  constant - —log {d'[Q  +  cj)2(P  -  JNr)]d}  +  —log 4>2  (12.34) 

Maximizing  (12.34),  over  <j)2  given  /?,  yields 

T2  _  _ ^  Qd _ _  X)j=i  Ylt=i(djt  ~  dj ,)2  .  . 

(T-l)d'(P-  JNT)d  T(T-1)££i(4-02 

Maximizing  (12.34)  over  f3,  given  02,  yields 


Pmle  =  {X'lQ  +  tfiP- Jnt)]X}  1  X'[Q  +  <j)2(P  —  JNT)\y  (12.36) 

One  can  iterate  between  j3  and  cj)2  until  convergence.  Breusch  (1987)  shows  that  provided  T  >  1, 
any  z-th  iteration  /?,  call  it  (3i:  gives  0  <  ^>f+1  <  oo  in  the  (i+1) th  iteration.  More  importantly, 
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Breusch  (1987)  shows  that  these  cf>2,s  have  a  remarkable  property  of  forming  a  monotonic  se¬ 
quence.  In  fact,  starting  from  the  Within  estimator  of  /?,  for  4> 2  =  0,  the  next  4>2  is  finite  and 
positive  and  starts  a  monotonically  increasing  sequence  of  02  ’s.  Similarly,  starting  from  the 
Between  estimator  of  /3 ,  for  ( </> 2  — >  oo)  the  next  4>2  is  finite  and  positive  and  starts  a  monotoni¬ 
cally  decreasing  sequence  of  (j)2ls.  Hence,  to  guard  against  the  possibility  of  a  local  maximum, 
Breusch  (1987)  suggests  starting  with  /3Within  and  (3 Between  and  iterating.  If  these  two  sequences 
converge  to  the  same  maximum,  then  this  is  the  global  maximum.  If  one  starts  with  Pols  for 
(j>2  =  1,  and  the  next  iteration  obtains  a  larger  p2,  then  we  have  a  local  maximum  at  the  bound¬ 
ary  q i>2  =  1.  Maddala  (1971)  finds  that  there  are  at  most  two  maxima  for  the  likelihood  L(p2) 
for  0  <  cj)2  <  1.  Hence,  we  have  to  guard  against  one  local  maximum. 


12.3  Prediction 


Suppose  we  want  to  predict  S  periods  ahead  for  the  i-th  individual.  For  the  random  effects 
model,  the  BLU  estimator  is  GLS.  Using  the  results  in  Chapter  9  on  GLS,  Goldberger’s  (1962) 
Best  Linear  Unbiased  Predictor  (BLUP)  of  y^T+s  is 

yi,T+s  =  Z'i  T+s6GLS  +  w'kt~lUGLS  for  S'  >  1  (12.37) 

where  ugls  =  y  —  Z8gls  and  w  =  E(ui:T+su).  Note  that 


Ui,T+S  =  Hi  +  V i,T+S 


(12.38) 


and  w  =  <8  rr)  where  ii  is  the  i-th  column  of  In,  i.e. ,  £i  is  a  vector  that  has  1  in  the  i-th 

position  and  zero  elsewhere.  In  this  case 


w'tt  1  =  ®  4) 


— o  P  H - nQ 


a 


=  -t(4®4) 


err 


(12.39) 


since  {£[  ifT)  P  =  (7(  <8 )  l't )  and  {£[  <8  l't)Q  =  0.  The  typical  element  of  w'Q  1ugls  is 
(Ta^/crDUi^GLS  where  Ui.tGLS  =  Ya=i  uu,gls/T.  Therefore,  in  (12.37),  the  BLUP  for  yi)T+s 
corrects  the  GLS  prediction  by  a  fraction  of  the  mean  of  the  GLS  residuals  corresponding  to 
that  i-th  individual.  This  predictor  was  considered  by  Wansbeek  and  Kapteyn  (1978)  and  Taub 
(1979). 


12.4  Empirical  Example 

Baltagi  and  Griffin  (1983)  considered  the  following  gasoline  demand  equation: 

i°g^  =  a  +  /3ilog^  +  /32log-^r^  +  /33iog^  +  u  (12.40) 

where  Gas/Car  is  motor  gasoline  consumption  per  auto,  Y/N  is  real  income  per  capita,  P mg / 
P gdp  is  real  motor  gasoline  price  and  Car/N  denotes  the  stock  of  cars  per  capita.  This  panel 
consists  of  annual  observations  across  eighteen  OECD  countries,  covering  the  period  1960-1978. 
The  data  for  this  example  are  provided  on  the  Springer  web  site  as  GASOLINE.DAT.  Table  12.1 
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gives  the  Stata  output  for  the  Within  estimator  using  xtreg,  fe.  This  is  the  regression  described 
in  (12.5)  and  computed  as  in  (12.9).  The  Within  estimator  gives  a  low  price  elasticity  for 
gasoline  demand  of  -.322.  The  E-statistic  for  the  significance  of  the  country  effects  described  in 
(12.14)  yields  an  observed  value  of  83.96.  This  is  distributed  under  the  null  as  an  E(17, 321) 
and  is  statistically  significant.  This  E-statistic  is  printed  by  Stata  below  the  fixed  effects  output. 
In  EViews,  one  invokes  the  test  for  redundant  effects  after  running  the  fixed  effects  regression. 


Table  12.1  Fixed  Effects  Estimator  -  Gasoline  Demand  Data 


Coef. 

Stcl.  Err. 

T 

P>\t\ 

[95%  Conf.  Interval] 

log(F/A) 

0.6622498 

0.073386 

9.02 

0.000 

0.5178715 

0.8066282 

log  (Pmg/Pgdp) 

-0.3217025 

0.0440992 

-7.29 

0.000 

-0.4084626 

-0.2349425 

log  (Car/N) 

-0.6404829 

0.0296788 

-21.58 

0.000 

-0.6988725 

-0.5820933 

Constant 

2.40267 

0.2253094 

10.66 

0.000 

1.959401 

2.84594 

sigma_u 

0.34841289 

sigma_e 

0.09233034 

Rho 

0.93438173 

(fraction  of 

variance  due  to  u_i) 

Table  12.2  gives  the  Stata  output  for  the  Between  estimator  using  xtreg ,  fte.This  is  based  on 
the  regression  given  in  (12.24).  The  Between  estimator  yields  a  high  price  elasticity  of  gasoline 
demand  of  -.964.  These  results  were  also  verified  using  TSP. 


Table  12.2  Between  Estimator  Gasoline  Demand  Data 


Coef. 

Std.  Err. 

T 

P>\t\ 

[95%  Conf.  Interval] 

log  (Y/N) 

0.9675763 

0.1556662 

6.22 

0.000 

0.6337055 

1.301447 

log  (Pmg/Pgdp) 

-0.9635503 

0.1329214 

7.25 

0.000 

-1.248638 

-0.6784622 

log  (Car/N) 

-0.795299 

0.0824742 

-9.64 

0.000 

-0.9721887 

-0.6184094 

Constant 

2.54163 

0.5267845 

4.82 

0.000 

1.411789 

3.67147 

Table  12.3  gives  the  Stata  output  for  the  random  effect  model  using  xtreg, re.  This  is  the 
Swarny  and  Arora  (1972)  estimator  which  yields  a  price  elasticity  of  -.420.  This  is  closer  to  the 
Within  estimator  than  the  Between  estimator. 


Table  12.3  Random  Effects  Estimator  -  Gasoline  Demand  Data 


Coef. 

Std.  Err. 

T 

P>\t\ 

[95%  Conf.  Interval] 

log(F/A) 

0.5549858 

0.0591282 

9.39 

0.000 

0.4390967 

0.6708749 

log  {Pmg/Pgdp) 

-0.4203893 

0.0399781 

-10.52 

0.000 

-0.498745 

-0.3420336 

log  (Car/N) 

-0.6068402 

0.025515 

-23.78 

0.000 

-0.6568487 

-0.5568316 

Constant 

1.996699 

0.184326 

10.83 

0.000 

1.635427 

2.357971 

sigma_u 

0.19554468 

sigma_e 

0.09233034 

Rho 

0.81769 

(fraction  of 

variance  due  to  u_i) 
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Table  12.4  Gasoline  Demand  Data.  One-way  Error  Component  Results 


Pi 

P2 

Ps 

P 

OLS 

0.890 

-0.892 

-0.763 

0 

(0.036)* 

(0.030)* 

(0.019)* 

WALHUS 

0.545 

-0.447 

-0.605 

0.75 

(0.066) 

(0.046) 

(0.029) 

AMEMIYA 

0.602 

-0.366 

-0.621 

0.93 

(0.066) 

(0.042) 

(0.029) 

SWAR 

0.555 

-0.402 

-0.607 

0.82 

(0.059) 

(0.042) 

(0.026) 

IMLE 

0.588 

-0.378 

-0.616 

0.91 

(0.066) 

(0.046) 

(0.029) 

*  These  are  biased  standard  errors  when  the  true  model  has  error  component  disturbances  (see  Moulton,  1986). 
Source:  Baltagi  and  Griffin  (1983).  Reproduced  by  permission  of  Elsevier  Science  Publishers  B.V.  (North-Holland). 


Table  12.5  Gasoline  Demand  Data.  Wallace  and  Hussain  (1969)  Estimator 


Dependent  Variable:  GAS 

Method:  Panel  EGLS  (Cross-section  random  effects) 

Sample:  1960  1978 

Periods  included:  19 

Cross-sections  included:  18 

Total  panel  (balanced)  observations:  342 

Wallace  and  Hussain  estimator  of  component  variances 


Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

1.938318 

0.201817 

9.604333 

0.0000 

log  (Y/N) 

0.545202 

0.065555 

8.316682 

0.0000 

log  {Pmg/Pgdp) 

-0.447490 

0.045763 

-9.778438 

0.0000 

\og(Car/N) 

-0.605086 

0.028838 

-20.98191 

0.0000 

Effects  Specification 

S.D. 

Rho 

Cross-section  random 

0.196715 

0.7508 

Idiosyncratic  random 

0.113320 

0.2492 

Table  12.4  gives  the  parameter  estimates  for  OLS  and  three  feasible  GLS  estimates  of  the  slope 
coefficients  along  with  their  standard  errors,  and  the  corresponding  estimate  of  p  defined  in 
(12.16).  These  were  obtained  using  EViews  by  invoking  the  random  effects  estimation  on  the 
individual  effects  and  choosing  the  estimation  method  from  the  options  menu.  Breusch’s  (1987) 
iterative  maximum  likelihood  was  computed  using  Stata (xtreg,  mle )  and  TSP. 

Table  12.5  gives  the  EViews  output  for  the  Wallace  and  Hussain  (1969)  random  effects  estima¬ 
tor,  while  Table  12.6  gives  the  EViews  output  for  the  Amemiya  (1971)  random  effects  estimator. 
Note  that  EViews  calls  the  Amemiya  estimator  Wansbeek  and  Kapteyn  (1989)  since  the  latter 
paper  generalizes  this  method  to  deal  with  unbalanced  panels  with  missing  observations,  see 
Baltagi  (2008)  for  details.  Table  12.6  gives  the  Stata  maximum  likelihood  output. 
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Table  12.6  Gasoline  Demand  Data.  Wansbeek  and  Kapteyn  (1989)  Estimator 


Dependent  Variable:  GAS 

Method:  Panel  EGLS  (Cross-section  random  effects) 

Sample:  1960  1978 

Periods  included:  19 

Cross-sections  included:  18 

Total  panel  (balanced)  observations:  342 

Wallace  and  Hussain  estimator  of  component  variances 


Coefficient 

Std.  Error 

t-Statistic 

Prob. 

c 

2.188322 

0.216372 

10.11372 

0.0000 

log  (Y/N) 

0.601969 

0.065876 

9.137941 

0.0000 

log  (Pmg/Pgdp) 

-0.365500 

0.041620 

-8.781832 

0.0000 

log  (Car  /N) 

-0.620725 

0.027356 

-22.69053 

0.0000 

Effects  Specification 

S.D. 

Rho 

Cross-section  random 

0.343826 

0.9327 

Idiosyncratic  random 

0.092330 

0.0673 

Table  12.7  Gasoline  Demand  Data.  Random  Effects  Maximum  Likelihood  Estimator 


.  xtreg  c  y  p  car,mle 


Random-effects  ML  regression 

Number  of  obs  = 

342 

Group  variable  (i):  coun 

Number  of  groups  = 

18 

Random  effects  u_i  ~  Gaussian 

Obs  per  group:  min  = 

19 

avg  = 

19.0 

max  = 

19 

LR  chi2(3) 

609.75 

Log  likelihood  =  282.47697 

Prob  >  chi2  = 

0.0000 

c 

Coef. 

Std.  Err. 

z 

P>\z\ 

[95%  Conf.  Interval] 

iog(y/JV) 

.5881334 

.0659581 

8.92 

0.000 

.4588578 

.717409 

log  {Pmg/Pgdp) 

-.3780466 

.0440663 

-8.58 

0.000 

-.464415 

-.2916782 

\og{Car/N) 

-.6163722 

.0272054 

-22.66 

0.000 

-.6696938 

-.5630506 

_cons 

2.136168 

.2156039 

9.91 

0.000 

1.713593 

2.558744 

sigma_u 

.2922939 

.0545496 

.2027512 

.4213821 

sigma_e 

.0922537 

.0036482 

.0853734 

.0996885 

rho 

.9094086 

.0317608 

.8303747 

.9571561 

Likelihood-ratio  test  of  sigma_u  =  0:  chibar2(01)=  463.97  Prob  >=  chibar2  =  0.000 
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12.5  Testing  in  a  Pooled  Model 

(1)  The  Chow-Test 

Before  pooling  the  data  one  may  be  concerned  whether  the  data  is  poolable.  This  hypothesis 
is  also  known  as  the  stability  of  the  regression  equation  across  firms  or  across  time.  It  can  be 
formulated  in  terms  of  an  unrestricted  model  which  involves  a  separate  regression  equation  for 
each  firm 


yi  =  Zidi  +  Ui  for  i  =  1,2, . . .  ,N  (12.41) 

where  y[  =  (yn, . . . ,  yir),  Zi  =  [m,  Xj\  and  Xi  is  (T  x  K ).  6[  is  1  x  (K  +  1)  and  Ui  is  T  x  1. 
The  important  thing  to  notice  is  that  Si  is  different  for  every  regional  equation.  We  want  to 
test  the  hypothesis  Hq ;  Si=6  for  all  i,  versus  Si  /  6  for  some  i.  Under  Ho  we  can  write  the 
restricted  model  given  in  (12.41)  as: 


y  =  ZS  +  u 


(12.42) 


where  Z'  =  (Z[,  Z'2, . . . ,  Z'N)  and  u'  =  (u^,  u'2, . .  ■ ,  u'N).  The  unrestricted  model  can  also  be 
written  as 

(  zl  0  ...  0  \  /  «5i  \ 


y  = 


o  z2  . . .  o 


S2 


+  u  =  Z*S*  +  u 


(12.43) 


\  0  0  ...  Zn  )  \  Sn  ) 

where  S*'  =  (S\ ,  S'2l  ■  ■  ■  ,S'N)  and  Z  =  Z*I*  with  I*  =  (ijy  <8>  Ik'),  an  NK'  x  K'  matrix,  with 
K'  =  K  +  1.  Hence  the  variables  in  Z  are  all  linear  combinations  of  the  variables  in  Z*.  Under 
the  assumption  that  u  ~  N(0,ij2Int),  the  MVU  estimator  for  S  in  equation  (12.42)  is 


^ols  =  Smle  =  ( Z'Z )  1Z’y 


(12.44) 


and  therefore 


y  =  ZSols  +  e  (12.45) 

implying  that  e  =  (Int  ~  Z{Z’ Z)~l Z’)y  =  My  =  M(ZS  +  u)  =  Mu  since  MZ  =  0.  Similarly, 
under  the  alternative,  the  MVU  for  Si  is  given  by 

Si,OLS  =  Si^MLE  =  (Z'Zi^Z'yi  (12.46) 

and  therefore 


Vi  =  ZiSitoLS  +  ei  (12.47) 

implying  that  e*  =  (IT  -  Zi{Z[Zi)~l Z[)yi  =  Mlyl  =  Mi(ZiSi  +  ui)  =  Mim  since  MjZj  =  0,  and 
this  is  true  for  *  =  1, 2, . . . ,  N.  Also,  let 


M*  =  INT  -  Z*{Z*'Z*)~1Z*' 


(Mi  0  ...  0  \ 

0  Mi  ...  0 


y  0  0  ...  M]\r  J 
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One  can  easily  deduce  that  y  =  Z* 6  +  e*  with  e*  =  M*y  =  M*u  and  8  =  (Z*1 Z*)~x Z*'y. 
Note  that  both  M  and  M*  are  symmetric  and  idempotent  with  AIM*  =  M* .  This  easily  follows 
since 


z(z'zy1z'z*(z*'z*)~1z*' 


z(z'z)~1i*'z*'z*(z*'z*y1z*' 


Z(Z'Z)-lZ' 


This  uses  the  fact  that  Z  =  Z*I*.  Now,  e' e  —  e+/e*  =  u'(M  —  M*)u  and  e*'e*  =  u'M*u  are 
independent  since  (M  —  M*)M*  =  0.  Also,  both  quadratic  forms  when  divided  by  a2  are 
distributed  as  x2  since  (M  —  M*)  and  M*  are  idempotent,  see  Judge  et  al.  (1985).  Dividing 
these  quadratic  forms  by  their  respective  degrees  of  freedom,  and  taking  their  ratio  leads  to  the 
following  test  statistic: 


Fobs 


(e'e  -  e*'e*)/(tr(M)  -  tr(M*)) 
e*'e*/tr  (M*) 


(12.48) 


(e'e  -  e[ei  -  e'2e2  -  -  e'NeN)/(N  -  1  )K' 

(e[ei  +  e'2e2  +  ..  +  e'NeN)/N(T  -  K’) 

Under  Hq,  F0bs  is  distributed  as  an  F((N  —  1  )K',  N(T  —  K ')),  see  lemma  2.2  of  Fisher  (1970). 
This  is  exactly  the  Chow’s  (1960)  test  extended  to  the  case  of  N  linear  regressions. 

The  URSS  in  this  case  is  the  sum  of  the  N  residual  sum  of  squares  obtained  by  applying 
OLS  to  (12.41),  i.e.,  on  each  firm  equation  separately.  The  RRSS  is  simply  the  RSS  from  OLS 
performed  on  the  pooled  regression  given  by  (12.42).  In  this  case,  there  are  (N-l)K'  restrictions 
and  the  URSS  has  N(T  —  K ')  degrees  of  freedom.  Similarly,  one  can  test  the  stability  of  the 
regression  across  time.  In  this  case,  the  degrees  of  freedom  are  (T  —  1)K'  and  N(T  —  K ') 
respectively.  Both  tests  target  the  whole  set  of  regression  coefficients  including  the  constant.  If 
the  LSDV  model  is  suspected  to  be  the  proper  specification,  then  the  intercepts  are  allowed  to 
vary  but  the  slopes  remain  the  same.  To  test  the  stability  of  the  slopes  only,  the  same  Chow- 
test  can  be  utilized,  however  the  RRSS  is  now  that  of  the  LSDV  regression  with  firm  (or  time) 
dummies  only.  The  number  of  restrictions  becomes  (N  —  1  )K  for  testing  the  stability  of  the 
slopes  across  firms  and  (T  —  1)K  for  testing  their  stability  across  time. 

The  Chow-test  however  is  proper  under  spherical  disturbances,  and  if  that  hypothesis  is  not 
correct  it  will  lead  to  improper  inference.  Baltagi  (1981)  showed  that  if  the  true  specification  of 
the  disturbances  is  an  error  components  structure  then  the  Chow-test  tend  to  reject  poolability 
too  often  when  in  fact  it  is  true.  However,  a  generalization  of  the  Chow-test  which  takes  care 
of  the  general  variance-covariance  matrix  is  available  in  Zellner  (1962).  This  is  exactly  the  test 
of  the  null  hypothesis  Hq]  R/3  =  r  when  D  is  that  of  the  error  components  specification,  see 
Chapter  9.  Baltagi  (1981)  shows  that  this  test  performs  well  in  Monte  Carlo  experiments.  In  this 
case,  all  we  need  to  do  is  transform  our  model  (under  both  the  null  and  alternative  hypotheses) 
such  that  the  transformed  disturbances  have  a  variance  of  cr2/7VT)  then  apply  the  Chow-test  on 
the  transformed  model.  The  later  step  is  legitimate  because  the  transformed  disturbances  have 
homoskedastic  variances  and  the  usual  Chow-test  is  legitimate.  Given  D  =  a2S,  we  premultiply 
the  restricted  model  given  in  (12.42)  by  ID1/2  and  we  call  E_1/2y  =  y.  Zi~l!2Z  =  Z  and 
S-1/2u  =  u.  Hence 


y  =  Z8  +  u  (12.49) 

with  E(uu')  =  E~1/2£,(rra/)E~1/2/  =  a 2Int-  Similarly,  we  premultiply  the  unrestricted  model 
given  in  (12.43)  by  E-1/2  and  we  call  E_1/2Z*  =  Z* .  Therefore 
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y  =  Z*8*  +  u 


(12.50) 


with  E(uu')  =  (J2Int ■ 

At  this  stage,  we  can  test  Hq]  8i  =  5  for  every  i  =  1,2 ,N,  simply  by  using  the  Chow- 
statistic,  only  now  on  the  transformed  models  (12.49)  and  (12.50)  since  they  satisfy  u  ~ 
N(0,(t2Int)-  Note  that  Z  =  Z*I*  which  is  simply  obtained  from  Z  =  Z*I*  by  premulti¬ 
plying  by  E-1/2.  Defining  M  =  Int  ~  Z (Zr Z)-1  Z' ,  and  M*  =  Int  ~  Z* (Z*' Z*)~l Z*' ,  it  is  easy 
to  show  that  M  and  M*  are  both  symmetric,  idempotent  and  such  that  MM*  =  M*.  Once 
again  the  conditions  for  lemma  2.2  of  Fisher  (1970)  are  satisfied,  and  the  test-statistic 


Fobs 


{e'e  -  e*/e*)/(tr(M)  -  tr(M*)) 
e*'e*/tr  (M*) 


~  F((N  -  1  )K',  N(T  -  K ')) 


(12.51) 


where  e  =  y  —  Z8ols  and  8ols  =  (Z1 Z)  1Z'y  implying  that  e  =  My  =  Mu.  Similarly, 

— s|=  — -^c 

e*  =  y  —  Z*  'bOLS  and  SOLS  =  (Z*f Z*)~2 Z*' y  implying  that  e*  =  M*ij  =  M*u.  This  is  the 
Chow-test  after  premultiplying  the  model  by  XC1/2  or  simply  applying  the  Fuller  and  Battese 
(1974)  transformation.  See  Baltagi  (2008)  for  details. 

For  the  gasoline  data  in  Baltagi  and  Griffin  (1983),  Chow’s  test  for  poolability  across  countries 
yields  an  observed  F-statistic  of  129.38  and  is  distributed  as  F(68,270)  under  Hq;  6i  =  8  for 
i  =  1, . . . ,  N.  This  tests  the  stability  of  four  time-series  regression  coefficients  across  18  countries. 
The  unrestricted  SSE  is  based  upon  18  OLS  time-series  regressions,  one  for  each  country.  For 
the  stability  of  the  slope  coefficients  only,  (3t  =  /?,  an  observed  F- value  of  27.33  is  obtained 
which  is  distributed  as  F(51,  270)  under  the  null.  Chow’s  test  for  poolability  across  time  yields  an 
F-value  of  0.276  which  is  distributed  as  F( 72, 266)  under  Hq;  8t  =  8  for  t  =  1, . . . ,  T.  This  tests 
the  stability  of  four  cross-section  regression  coefficients  across  19  time  periods.  The  unrestricted 
SSE  is  based  upon  19  OLS  cross-section  regressions,  one  for  each  year.  This  does  not  reject 
poolability  across  time-periods.  The  test  for  poolability  across  countries,  allowing  for  a  one-way 
error  components  model  yields  an  F-value  of  21.64  which  is  distributed  as  F( 68,  270)  under  Hq] 
8i  =  8  for  i  =  1, . . . ,  N.  The  test  for  poolability  across  time  yields  an  F- value  of  1.66  which  is 
distributed  as  F( 72, 266)  under  Hq]  8t  =  8  for  t  =  1, . . . ,  T.  This  rejects  Hq  at  the  5%  level. 


(2)  The  Breusch-Pagan  Test 


Next,  we  look  at  a  Lagrange  Multiplier  test  developed  by  Breusch  and  Pagan  (1980),  which 
tests  whether  Hq\  a2  =  0.  The  test  statistic  is  given  by 


LM  =  (NT/2(T  —  1))  (Eili 


el/ZliZli 


(12.52) 


where  eu  denotes  the  OLS  residuals  on  the  pooled  model,  ej.  denote  their  sum  over  t,  respec¬ 
tively.  Under  the  null  hypothesis  Hq  this  LM  statistic  is  distributed  as  a  Xi-  For  the  gasoline 
data  in  Baltagi  and  Griffin  (1983),  the  Breusch  and  Pagan  LM  test  yields  an  LM  statistic  of 
1465.6.  This  is  obtained  using  the  Stata  command  xtestO  after  estimating  the  model  with  ran¬ 
dom  effects.  This  is  significant  and  rejects  the  null  hypothesis.  The  corresponding  likelihood 
ratio  test  assuming  Normal  disturbances  is  also  reported  by  Stata  maximum  likelihood  output 
for  the  random  effects  model.  This  yields  an  LR  statistic  of  463.97  which  is  asymptotically 
distributed  as  xi  under  the  null  hypothesis  Hq  and  is  also  significant. 
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One  problem  with  the  Breusch-Pagan  test  is  that  it  assumes  that  the  alternative  hypothesis 
is  two-sided  when  we  know  that  a2^  >  0.  A  one-sided  version  of  this  test  is  given  by  Honda 
(1985): 


HO  = 


NT 


2(T  —  1) 


e!(lN  <8>  Jt)& 


-  1 


H0 


N(  0,1) 


(12.53) 


where  e  denotes  the  vector  of  OLS  residuals.  Note  that  the  square  of  this  1V(0, 1)  statistic  is 
the  Breusch  and  Pagan  (1980)  LM  test-statistic.  Honda  (1985)  finds  that  this  test  statistic  is 
uniformly  most  powerful  and  robust  to  non-normality.  However,  Moulton  and  Randolph  (1989) 
showed  that  the  asymptotic  1V(0, 1)  approximation  for  this  one-sided  LM  statistic  can  be  poor 
even  in  large  samples.  They  suggest  an  alternative  Standardized  Lagrange  Multiplier  (SLM) 
test  whose  asymptotic  critical  values  are  generally  closer  to  the  exact  critical  values  than  those 
of  the  LM  test.  This  SLM  test  statistic  centers  and  scales  the  one-sided  LM  statistic  so  that  its 
mean  is  zero  and  its  variance  is  one. 


SLM  = 

-y/vai '(HO) 


d  -  E(d) 
-y/  var(d) 


(12.54) 


where  d  =  e' De/e'e  and  D  =  (Ijv  ®  Jr)-  Using  the  results  on  moments  of  quadratic  forms  in 
regression  residuals,  see  for  e.g.,  Evans  and  King  (1985),  we  get 


E{d)  =  tr(DPz)/p 


and 


var (d)  =  2{p  tr [DPzf  -  [tr (DPz)]2}/p\p  +  2)  (12.55) 

where  p  =  n  —  ( K  +  1)  and  Pz  =  In  —  Z{Z'  Z)~l  Z' .  Under  the  null  hypothesis,  SLM  has  an 
asymptotic  1V(0, 1)  distribution. 


(3)  The  Hausman-Test 

A  critical  assumption  in  the  error  components  regression  model  is  that  E{uu/ Xu)  =  0.  This  is 
important  given  that  the  disturbances  contain  individual  effects  (the  ptfs)  which  are  unobserved 
and  may  be  correlated  with  the  X^  s.  For  example,  in  an  earnings  equation  these  /q’s  may  denote 
unobservable  ability  of  the  individual  and  this  may  be  correlated  with  the  schooling  variable 
included  on  the  right  hand  side  of  this  equation.  In  this  case,  E(uu/Xit)  /  0  and  the  GLS 
estimator  PG^g  becomes  biased  and  inconsistent  for  /3.  However,  the  within  transformation 
wipes  out  these  /q’s  and  leaves  the  Within  estimator  (3 within  unbiased  and  consistent  for  /3. 
Hausman  (1978)  suggests  comparing  PGig  and  Pwithim  both  of  which  are  consistent  under  the 
null  hypothesis  Hq]  E{uu/Xit)  =  0,  but  which  will  have  different  probability  limits  if  Hq  is  not 
true.  In  fact,  Pwuhin  is  consistent  whether  Hq  is  true  or  not,  while  Pols  U  BLUE,  consistent  and 
asymptotically  efficient  under  Hq,  but  is  inconsistent  when  Hq  is  false.  A  natural  test  statistic 
would  be  based  on  q  =  /3GLS  —  flwithin-  Under  Ho,  plirn  q  =  0j_and  co v(q,  (3qls)  =  0. 

Using  the  fact  that  /3Gls  ~  P  =  {X'Q~1X)~1X'Q~1u  and  Pwithin  ~  P  =  {X'QX)~1X'Qu, 
one  gets  E(q)  =  0  and 

cav(PGLSi  o)  =  ™(PGLS)  -  COV0 GLS,  P Within) 

=  (x'n-'x)-1  -  (x'n^x^x'n^Eiuu^Qxix'QX)-1  =  o 
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Using  the  fact  that  PWlthm  =  3gls  -  Q,  one  gets 

var  (Pwithin)  =  var  0GLS)  +  var®, 

since  cov((3GLS,q)  =  0.  Therefore, 

var ($)  =  var(fiWithin)  -  var(ftGLS)  =  a l(X'QX)-1  -  (X'Q^X)-1  (12.56) 

Hence,  the  Hausman  test  statistic  is  given  by 

m  =  (f[var(g)]-1(f  (12.57) 

and  under  Hq  is  asymptotically  distributed  as  Xk>  where  K  denotes  the  dimension  of  slope 
vector  (3.  In  order  to  make  this  test  operational,  H  is  replaced  by  a  consistent  estimator  H,  and 
GLS  by  its  corresponding  FGLS.  An  alternative  asymptotically  equivalent  test  can  be  obtained 
from  the  augmented  regression 

y*  =  X*f3  +  X7  +  w  (12.58) 

where  y*  =  avQ.~l/2y,  X*  =  au£l~1,/2X  and  X  =  QX.  Hausman’s  test  is  now  equivalent  to 
testing  whether  7  =  0.  This  is  a  standard  Wald  test  for  the  omission  of  the  variables  X  from 
(12.58). 

This  test  was  generalized  by  Arellano  (1993)  to  make  it  robust  to  heteroskedasticity  and 
autocorrelation  of  arbitrary  forms.  In  fact,  if  either  heteroskedasticity  or  serial  correlation  is 
present,  the  variances  of  the  Within  and  GLS  estimators  are  not  valid  and  the  corresponding 
Hausman  test  statistic  is  inappropriate.  For  the  Baltagi  and  Griffin  (1983)  gasoline  data,  the 
Hausman  test  statistic  based  on  the  difference  between  the  Within  estimator  and  that  of  feasible 
GLS  based  on  Swamy  and  Arora  (1972)  yields  a  X3  value  of  m  =  306.1  which  rejects  the  null 
hypothesis.  This  is  obtained  using  the  Stata  command  hausman. 


12.6  Dynamic  Panel  Data  Models 

The  dynamic  error  components  regression  is  characterized  by  the  presence  of  a  lagged  dependent 
variable  among  the  regressors,  i.e. , 

Vit  =  hyi)t-\  +  x'it(3  +  Hi  + vit,  i  =  l,...,N;  t=l,...,T  (12.59) 

where  8  is  a  scalar,  x'it  is  1  x  K  and  (3  is  K  x  1.  This  model  has  been  extensively  studied  by 
Anderson  and  Hsiao  (1982).  Since  yu  is  a  function  of  /q,  yi.t-\  is  also  a  function  of  y, .  Therefore, 
yit—i,  a  right  hand  regressor  in  (12.59),  is  correlated  with  the  error  term.  This  renders  the 
OLS  estimator  biased  and  inconsistent  even  if  the  z^j’s  are  not  serially  correlated.  For  the  FE 
estimator,  the  within  transformation  wipes  out  the  y^s,  but  yt,t- 1  will  still  be  correlated  with 
vu  even  if  the  are  not  serially  correlated.  In  fact,  the  Within  estimator  will  be  biased  of 
0(1/T)  and  its  consistency  will  depend  upon  T  being  large,  see  Nickell  (1981).  An  alternative 
transformation  that  wipes  out  the  individual  effects,  yet  does  not  create  the  above  problem 
is  the  first  difference  (FD)  transformation.  In  fact,  Anderson  and  Hsiao  (1982)  suggested  first 
differencing  the  model  to  get  rid  of  the  /q’s  and  then  using  Ay^_2  =  {yi,t- 2  —  Vi,t- 3)  or 
simply  yitt~ 2  as  an  instrument  for  Aylt_  1  =  (ytj-i  —  yi,t- 2)-  These  instruments  will  not  be 
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correlated  with  A =  Vij— as  long  as  the  vn' s  themselves  are  not  serially  correlated.  This 
instrumental  variable  (IV)  estimation  method  leads  to  consistent  but  not  necessarily  efficient 
estimates  of  the  parameters  in  the  model.  This  is  because  it  does  not  make  use  of  all  the 
available  moment  conditions,  see  Ahn  and  Schmidt  (1995),  and  it  does  not  take  into  account  the 
differenced  structure  on  the  residual  disturbances  (Avu) .  Arellano  (1989)  finds  that  for  simple 
dynamic  error  components  models  the  estimator  that  uses  differences  Ayi^-2  rather  than  levels 
Ui,t—2  for  instruments  has  a  singularity  point  and  very  large  variances  over  a  significant  range  of 
parameter  values.  In  contrast,  the  estimator  that  uses  instruments  in  levels,  i.e. ,  yi.t-2-,  has  no 
singularities  and  much  smaller  variances  and  is  therefore  recommended.  Additional  instruments 
can  be  obtained  in  a  dynamic  panel  data  model  if  one  utilizes  the  orthogonality  conditions  that 
exist  between  lagged  values  of  yu  and  the  disturbances  uu- 

Let  us  illustrate  this  with  the  simple  autoregressive  model  with  no  regressors: 

yu  =  %,t-i  +  Uit  i  =  l,...,N  t=l,...,T  (12.60) 

where  uu  =  y,  +  vu  with  /it  IID(0,<t£)  and  vit  IID(0,fj^),  independent  of  each  other  and 
among  themselves.  In  order  to  get  a  consistent  estimate  of  8  as  N  — >  oo  with  T  fixed,  we  first 
difference  (12.60)  to  eliminate  the  individual  effects 

Vit  -  Vi,t- 1  =  1  _  Vi,t- 2)  +  (vit  ~  Vi,t- 1)  (12.61) 

and  note  that  (uu  —  Vit—i)  is  MA(1)  with  unit  root.  For  the  first  period  we  observe  this 
relationship,  i.e.,  t  =  3,  we  have 

Vi3  ~  Vi2  =  S(yi2  -  yn)  +  (^3  -  ^2) 

In  this  case,  yt\  is  a  valid  instrument,  since  it  is  highly  correlated  with  (yi2  —  yu)  and  not 
correlated  with  (vi 3  —  as  long  as  the  vit  are  not  serially  correlated.  But  note  what  happens 

for  t  =  4,  the  second  period  we  observe  (12.61): 


Vi 4  -  Vi3  =  8(yi 3  -  yi2 )  +  (ih4  -  ^3) 


In  this  case,  ya  as  well  as  yt\  are  valid  instruments  for  (j/j 3  —  y^),  since  both  yi2  and  yu 
are  not  correlated  with  (^4  —  1^*3).  One  can  continue  in  this  fashion,  adding  an  extra  valid 
instrument  with  each  forward  period,  so  that  for  period  T,  the  set  of  valid  instruments  becomes 
(yn,yi2,  ■  ■  ■  ,yi,T- 2)- 

This  instrumental  variable  procedure  still  does  not  account  for  the  differenced  error  term  in 
(12.61).  In  fact, 

E(Aui  Az/')  =  a2uG 


where  Av\  =  (z/i3  -  vi2, ...  ,ulT  -  and 


/  2  -1  0 
-1  2  -1 


G  = 


0 

V  0 


0  0 

0  0 


0  0  0  \ 

000 

-1  2  -1 

0-1  2  / 


(12.62) 
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is  [T  —  2)  x  (T  —  2),  since  Azq  is  MA(1)  with  unit  root.  Define 


Wi  = 


foil] 

[yn,yi2\ 


o 


o 


foil?  •  •  •1y*,T-2] 


(12.63) 


Then,  the  matrix  of  instruments  is  W  =  [W{, . . . ,  and  the  moment  equations  described 
above  are  given  by  E(W-Avi)  =  0.  Premultiplying  the  differenced  equation  (12.61)  in  vector 
form  by  W' ,  one  gets 


W'Ay  =  W'(Ay-i)6  +  W'Av 


(12.64) 


Performing  GLS  on  (12.64)  one  gets  the  Arellano  and  Bond  (1991)  preliminary  one-step  con¬ 
sistent  estimator 

=  [{Ay_1)lW{Wl(IN®G)W)-lW\Ay„l)}-1  (12.65) 

x  [(Ay^yW(W'(IN  ®  G)W)~1W'(Ay)} 

The  optimal  generalized  method  of  moments  (GMM)  estimator  of  <5i  a  la  Hansen  (1982)  for 
N  — >  oo  and  T  fixed  using  only  the  above  moment  restrictions  yields  the  same  expression  as  in 
(12.65)  except  that 


N 

W'(IN  ®  G)W  =  W-GWi 

1=1 


is  replaced  by 
N 

VN  =  YJW'i(Avi)(Avi)'Wi 

i=l 

This  GMM  estimator  requires  no  knowledge  concerning  the  initial  conditions  or  the  distributions 
of  Vi  and  /q .  To  operationalize  this  estimator,  Av  is  replaced  by  differenced  residuals  obtained 
from  the  preliminary  consistent  estimator  8\.  The  resulting  estimator  is  the  two-step  Arellano 
and  Bond  (1991)  GMM  estimator: 

?2  =  [(Ay_1)/W^1lP/(Ay_i)]-1[(Ay_i)/lT^V/(Ay)]  (12.66) 

A  consistent  estimate  of  the  asymptotic  var(<?>2)  is  given  by  the  first  term  in  (12.66), 

var(?2)  =  [{Ay^yWV^W'iAy^)}-1  (12.67) 

Note  that  <5i  and  82  are  asymptotically  equivalent  if  the  vu  are  IID(0,  cr^). 

If  there  are  additional  strictly  exogenous  regressors  xa  as  in  (12.59)  with  E(xuViS)  =  0  for 
all  t,s  =  1,2,  ...,T,  but  where  all  the  xa  are  correlated  with  /q,  then  all  the  xu  are  valid 
instruments  for  the  first  differenced  equation  of  (12.59).  Therefore,  [a^x^,  ■  ■  ■  ,x'iT]  should  be 
added  to  each  diagonal  element  of  Wj  in  (12.63).  In  this  case,  (12.64)  becomes 


W'Ay  =  W'(Ay-i)8  +  W' (AX)  0  +  W' Av 
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where  AX  is  the  stacked  N(T  —  2)  x  K  matrix  of  observations  on  Axu-  One-  and  two-step 
estimators  of  (8,  /3') can  be  obtained  from 


( 


6 

P 


{[Ay^AXj'WV^W'iAy^AX^-WAy^AXyWV^W'Ay) 


(12.68) 


as  in  (12.65)  and  (12.66). 

Arellano  and  Bond  (1991)  suggest  Sargan’s  (1958)  test  of  over-identifying  restrictions  given 
by 
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where  p  refers  to  the  number  of  columns  of  W  and  A  u  denote  the  residuals  from  a  two-step 
estimation  given  in  (12.68). 

To  summarize,  dynamic  panel  data  estimation  of  equation  (12.59)  with  individual  fixed  effects 
suffers  from  the  Nickell  (1981)  bias.  This  disappears  only  if  T  tends  to  infinity.  Alternatively, 
a  GMM  estimator  was  suggested  by  Arellano  and  Bond  (1991)  which  basically  differences  the 
model  to  get  rid  of  the  individual  specific  effects  and  along  with  it  any  time  invariant  regressor. 
This  also  gets  rid  of  any  endogeneity  that  may  be  due  to  the  correlation  of  these  individual  effects 
and  the  right  hand  side  regressors.  The  moment  conditions  utilize  the  orthogonality  conditions 
between  the  differenced  errors  and  lagged  values  of  the  dependent  variable.  This  assumes  that 
the  original  disturbances  are  serially  uncorrelated.  In  fact,  two  diagnostics  are  computed  using 
the  Arellano  and  Bond  GMM  procedure  to  test  for  first  order  and  second  order  serial  correlation 
in  the  disturbances.  One  should  reject  the  null  of  the  absence  of  first  order  serial  correlation 
and  not  reject  the  absence  of  second  order  serial  correlation.  A  special  feature  of  dynamic  panel 
data  GMM  estimation  is  that  the  number  of  moment  conditions  increase  with  T.  Therefore,  a 
Sargan  test  is  performed  to  test  the  over-identification  restrictions.  There  is  convincing  evidence 
that  too  many  moment  conditions  introduce  bias  while  increasing  efficiency.  It  is  even  suggested 
that  a  subset  of  these  moment  conditions  be  used  to  take  advantage  of  the  trade-off  between 
the  reduction  in  bias  and  the  loss  in  efficiency,  see  Baltagi  (2008)  for  details. 

Arellano  and  Bond  (1991)  apply  their  GMM  estimation  and  testing  methods  to  a  model  of 
employment  using  a  panel  of  140  quoted  UK  companies  for  the  period  1979-84.  This  is  the 
benchmark  data  set  used  in  Stata  to  obtain  the  one-step  and  two-step  estimators  described 
in  (12.65)  and  (12.66)  as  well  as  the  Sargan  test  for  over-identification  using  the  command 
(xtabond,twostep).The  reader  is  asked  to  replicate  their  results  in  problem  22. 


12.6.1  Empirical  Illustration 

Baltagi,  Griffin  and  Xiong  (2000)  estimate  a  dynamic  demand  model  for  cigarettes  based  on 
panel  data  from  46  American  states  over  30  years  1963-1992.  The  estimated  equation  is 

In  Cit  =  a  +  P1  In  Ci}t- 1  +  P2  ln  pi,t  +  P 3 In  Yit  +  Pa  In  Pnit  +  uit  (12.69) 

where  the  subscript  i  denotes  the  ith  state  (i  =  1, . . .  ,46),  and  the  subscript  t  denotes  the  fth 
year  ( t  =  1, . . .  ,30).  Cu  is  real  per  capita  sales  of  cigarettes  by  persons  of  smoking  age  (14 
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years  and  older).  This  is  measured  in  packs  of  cigarettes  per  head.  Pu  is  the  average  retail  price 
of  a  pack  of  cigarettes  measured  in  real  terms.  Ylt  is  real  per  capita  disposable  income.  Pna 
denotes  the  minimum  real  price  of  cigarettes  in  any  neighboring  state.  This  last  variable  is  a 
proxy  for  the  casual  smuggling  effect  across  state  borders.  It  acts  as  a  substitute  price  attracting 
consumers  from  high-tax  states  like  Massachusetts  to  cross  over  to  New  Hampshire  where  the 
tax  is  low.  The  disturbance  term  is  specified  as  a  two-way  error  component  model: 

uu  =  Hi  +  Xt  +  vu  i  =  1,...,46  i  =  1, . . . ,  30  (12.70) 

where  /q  denotes  a  state-specific  effect,  and  A t  denotes  a  year-specific  effect.  The  time-period 
effects  (the  A t)  are  assumed  fixed  parameters  to  be  estimated  as  coefficients  of  time  dummies 
for  each  year  in  the  sample.  This  can  be  justified  given  the  numerous  policy  interventions  as 
well  as  health  warnings  and  Surgeon  General’s  reports.  For  example: 

(1)  the  imposition  of  warning  labels  by  the  Federal  Trade  Commission  effective  January  1965; 

(2)  the  application  of  the  Fairness  Doctrine  Act  to  cigarette  advertising  in  June  1967,  which 
subsidized  antismoking  messages  from  1968  to  1970; 

(3)  the  Congressional  ban  on  broadcast  advertising  of  cigarettes  effective  January  1971. 

The  are  state-specific  effects  which  can  represent  any  state-specific  characteristic  including 
the  following: 

(1)  States  with  Indian  reservations  like  Montana,  New  Mexico  and  Arizona  are  among  the 
biggest  losers  in  tax  revenues  from  non-Indians  purchasing  tax-exempt  cigarettes  from  the 
reservations. 

(2)  Florida,  Texas,  Washington  and  Georgia  are  among  the  biggest  losers  of  revenues  due  to 
the  purchasing  of  cigarettes  from  tax-exempt  military  bases  in  these  states. 

(3)  Utah,  which  has  a  high  percentage  of  Mormon  population  (a  religion  which  forbids  smok¬ 
ing),  has  a  per  capita  sales  of  cigarettes  in  1988  of  55  packs,  a  little  less  than  half  the 
national  average  of  113  packs. 

(4)  Nevada,  which  is  a  highly  touristic  state,  has  a  per  capita  sales  of  cigarettes  of  142  packs 
in  1988,  29  more  packs  than  the  national  average. 

These  state-specific  effects  may  be  assumed  fixed,  in  which  case  one  includes  state  dummy 
variables  in  equation  (12.69).  The  resulting  estimator  is  the  Within  estimator  reported  in  Table 
12.8.  Comparing  these  estimates  with  OLS  without  state  or  time  dummies,  one  can  see  that 
the  coefficient  of  lagged  consumption  drops  from  0.97  to  0.83  and  the  price  elasticity  goes  up 
in  absolute  value  from  —0.09  to  —0.30.  The  income  elasticity  switches  sign  from  negative  to 
positive  going  from  —0.03  to  0.10. 

The  OLS  and  Within  estimators  do  not  take  into  account  the  endogeneity  of  the  lagged  de¬ 
pendent  variable,  and  therefore  2SLS  and  Within-2SLS  are  performed.  The  instruments  used 
are  one  lag  on  price,  neighboring  price  and  income.  These  give  lower  estimates  of  lagged  con¬ 
sumption  and  higher  own  price  elasticities  in  absolute  value.  The  Arellano  and  Bond  (1991) 
two-step  estimator  yields  an  estimate  of  lagged  consumption  of  0.70  and  a  price  elasticity  of 
—0.40,  both  of  which  are  significant.  Sargan’s  test  for  over-identification  yields  an  observed 
value  of  32.3.  This  is  asymptotically  distributed  as  X27  and  is  not  significant.  This  was  ob¬ 
tained  using  the  Stata  command  (xtabond2,  two  step)  with  the  collapse  option  to  reduce  the 
number  of  moment  conditions  used  for  estimation. 
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lnCi,t- i 

lnPit 

lnYit 

InPriit 

OLS 

0.97 

-0.090 

-0.03 

0.024 

(157.7) 

(6.2) 

(5.1) 

(1.8) 

Within 

0.83 

-0.299 

0.10 

0.034 

(66.3) 

(12.7) 

(4.2) 

(1.2) 

2SLS 

0.85 

-0.205 

-0.02 

0.052 

(25.3) 

(5.8) 

(2.2) 

(3.1) 

Within-2SLS 

0.60 

-0.496 

0.19 

-0.016 

(17.0) 

(13.0) 

(6.4) 

(0.5) 

Arellano  and  Bond  (two-step) 

0.70 

-0.396 

0.13 

-0.003 

(10.2) 

(6.0) 

(3.5) 

(0.1) 

*  Numbers  in  parentheses  are  t-statistics. 

Source:  Some  of  the  results  in  this  Table  are  reported  in  Baltagi,  Griffin  and  Xiong  (2000). 


12.7  Program  Evaluation  and  Difference-in- Differences  Estimator 

Suppose  we  want  to  study  the  effect  of  job  training  programs  on  earnings.  An  ideal  experiment 
would  assign  individuals  randomly,  by  a  flip  of  a  coin,  to  training  and  non-training  camps,  and 
then  compare  their  earnings,  holding  other  factors  constant.  This  is  a  necessary  experiment 
before  the  approval  of  any  drug.  Patients  are  randomly  assigned  to  receive  the  drug  or  a 
placebo  and  the  drug  is  approved  or  disapproved  depending  on  the  difference  in  the  outcome 
between  these  two  groups.  In  this  case,  the  FDA  is  concerned  with  the  drug’s  safety  and  its 
effectiveness.  However,  we  run  into  problems  in  setting  this  experiment.  How  can  we  hold  other 
factors  constant?  Even  twins  which  have  been  used  in  economic  studies  are  not  identical  and 
may  have  different  life  experiences. 

The  individual’s  prior  work  experience  will  affect  one’s  chances  in  getting  a  job  after  training. 
But  as  long  as  the  individuals  are  randomly  assigned,  the  distribution  of  work  experience  is  the 
same  in  the  treatment  and  control  group,  i.e. ,  participation  in  the  job  training  is  independent 
of  prior  work  experience.  In  this  case,  omitting  previous  work  experience  from  the  analysis  will 
not  cause  omitted  variable  bias  in  the  estimator  of  the  effect  of  the  training  program  on  future 
employment.  Stock  and  Watson  (2003)  discuss  threats  to  the  internal  and  external  validity  of 
such  experiments.  The  former  include:  (i)  failure  to  randomize,  or  (ii)  to  follow  the  treatment 
protocol.  These  failures  can  cause  bias  in  estimating  the  effect  of  the  treatment.  The  first 
can  happen  when  individuals  are  assigned  non-randomly  to  the  treatment  and  non-treatment 
groups.  The  second  can  happen,  for  example,  when  some  people  in  the  training  program  do 
not  show  up  for  all  training  sessions;  or  when  some  people  who  are  not  supposed  to  be  in  the 
training  program  are  allowed  to  attend  some  of  these  training  sessions.  Attrition  caused  by 
people  dropping  out  of  the  experiment  in  either  group  can  cause  bias  especially  if  the  cause 
of  attrition  is  related  to  their  acquiring  or  not  acquiring  training.  In  addition,  small  samples, 
usually  associated  with  expensive  experiments,  can  affect  the  precision  of  the  estimates.  There 
can  also  be  experimental  effects,  brought  about  by  people  trying  harder  simply  because  the 
worker  being  trained  feels  noticed  or  because  the  trainer  has  a  stake  in  the  success  of  the 
program.  Stock  and  Watson  (2003,  p.  380)  argue  that  “threats  to  external  validity  compromise 
the  ability  to  generalize  the  results  of  the  experiment  to  other  populations  and  settings.  Two 
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such  threats  are  when  the  experimental  sample  is  not  representative  of  the  population  of  interest 
and  when  the  treatment  being  studied  is  not  representative  of  the  treatment  that  would  be 
implemented  more  broadly.” 

They  also  warn  about  “general  equilibrium  effects”  where,  for  example,  turning  a  small,  tem¬ 
porary  experimental  program  into  a  widespread,  permanent  program  might  change  the  economic 
environment  sufficiently  that  the  results  of  the  experiment  cannot  be  generalized.  For  example, 
it  could  displace  employer-provided  training,  thereby  reducing  the  net  benefits  of  the  program. 

12.7.1  The  Difference-in-Differences  Estimator 

With  panel  data,  observations  on  the  same  subjects  before  and  after  the  training  program  allow 
us  to  estimate  the  effect  of  this  program  on  earnings.  In  simple  regression  form,  assuming 
the  assignment  to  the  training  program  is  random,  one  regresses  the  change  in  earnings  before 
and  after  training  is  completed  on  a  dummy  variable  which  takes  the  value  1  if  the  individual 
received  training  and  zero  if  they  did  not.  This  regression  computes  the  average  change  in 
earnings  for  the  treatment  group  before  and  after  the  training  program  and  subtracts  that  from 
the  average  change  in  earnings  for  the  control  group.  One  can  include  additional  regressors 
which  measure  the  individual  characteristics  prior  to  training.  Examples  are  gender,  race, 
education  and  age  of  the  individual. 

Card  (1990)  used  a  quasi-experiment  to  see  whether  immigration  reduces  wages.  Taking 
advantage  of  the  “Mariel  boatlift”  where  a  large  number  of  Cuban  immigrants  entered  Miami. 
Card  (1990)  used  the  difference-in-differences  estimator,  comparing  the  change  in  wages  of  low- 
skilled  workers  in  Miami  to  the  change  in  wages  of  similar  workers  in  other  comparable  U.S. 
cities  over  the  same  period.  Card  concluded  that  the  influx  of  Cuban  immigrants  had  a  negligible 
effect  on  wages  of  less-skilled  workers. 


Problems 

1.  Fixed,  Effects  and  the  Within  Transformation. 

(a)  Premultiply  (12.11)  by  Q  and  verify  that  the  transformed  equation  reduces  to  (12.12).  Show 
that  the  new  disturbances  Qis  have  zero  mean  and  variance-covariance  matrix  crfQ. 

Hint:  QZ M  =  0. 

(b)  Show  that  the  GLS  estimator  is  the  same  as  the  OLS  estimator  on  this  transformed  regression 
equation.  Hint:  Use  one  of  the  necessary  and  sufficient  conditions  for  GLS  to  be  equivalent 
to  OLS  given  in  Chapter  9. 

(c)  Using  the  Frisch- Waugh-Lovell  Theorem  given  in  Chapter  7,  show  that  the  estimator  derived 
in  part  (b)  is  the  Within  estimator  and  is  given  by  j3  =  (X'QX)~1X'Qy. 

2.  Variance-Covariance  Matrix  of  Random  Effects. 

(a)  Show  that  U  given  in  (12.17)  can  be  written  as  (12.18). 

(b)  Show  that  P  and  Q  are  symmetric,  idempotent,  orthogonal  and  sum  to  the  identity  matrix. 

(c)  For  fl”1  given  by  (12.19),  verify  that  flfl-1  =  U_1U  =  Int- 

(d)  For  VL~1/2  given  by  (12.20),  verify  that  fU1/2^-1/2  =  Q_1. 
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3.  Fuller  and  Battese  (1974)  Transformation.  Premultiply  y  by  avCr1^2  where  fA1/2  is  defined  in 
(12.20)  and  show  that  the  resulting  y*  has  a  typical  element  y*t  =  yu—Oyi.,  where  the  9  =  l—av/ai 
and  a2  =  Ta2^  +  a2. 

4.  Unbiased  Estimates  of  the  Variance- Components.  Using  (12.21)  and  (12.22),  show  that  A(aq)  =  a\ 
and  E(d\,)  =  a2.  Hint:  E(u'Qu )  =  E{tv(u'Qu)}  =  E{tr(uu'Q)}  =  tr {E(uu')Q}  =  tr(fl(3). 

5.  Swamy  and  Arora  (1972)  Estimates  of  the  Variance- Components. 

^2 

(a)  Show  that  cr„  given  in  (12.23)  is  unbiased  for  cr2. 

(b)  Show  that  a1  given  in  (12.26)  is  unbiased  for  cr2. 

6.  System  Estimation. 

(a)  Perform  OLS  on  the  system  of  equations  given  in  (12.27)  and  show  that  the  resulting  esti¬ 
mator  is  Sols  =  ( Z’Z)~1Z'y . 

(b)  Perform  GLS  on  this  system  of  equations  and  show  that  the  resulting  estimator  is  Sols  = 
(Z'UL~l Z)~l Z'UL~1y  where  f2_1  is  given  in  (12.19). 

7.  Random  Effects  Is  More  Efficient  than  Fixed  Effects.  Using  the  var(/3GiS)  expression  below  (12.30) 
and  vAr(f3Wlthln)  =  cr2Wx) c,  show  that 

(var(3GiS))_1  -  (var(/3within))^  =  <f2Bxx/cr2 

which  is  positive  semi-definite.  Conclude  that  var ((3Within)—  var(/3 qls)  positive  semi-definite. 

8.  Maximum  Likelihood  Estimation  of  the  Random  Effects  Model. 

(a)  Using  the  concentrated  likelihood  function  in  (12.34),  solve  dLc/d(j)2  =  0  and  verify  (12.35). 

(b)  Solve  dLc/d(3  =  0  and  verify  (12.36). 

9.  Prediction  in  the  Random  Effects  Model. 

(a)  For  the  predictor  of  yi.T+s  given  in  (12.37),  compute  E(ui^T+sUit)  far  t  =  1,2, ...  ,T  and 

verify  that  w  =  E(u,ijT+Su)  =  ®  lt)  where  £j  is  the  i-th  column  of  Ix. 

(b)  Verify  (12.39)  by  showing  that  (I\  ®  i'T)P  =  (£(  ®  l't). 

10.  Using  the  gasoline  demand  data  of  Baltagi  and  Griffin  (1983),  given  on  the  Springer  web  site  as 
GASOLINE.DAT,  reproduce  Tables  12.1  through  12.7. 

11.  Bounds  on  s 2  in  the  Random  Effects  Model.  For  the  random  one-way  error  components  model 
given  in  (12.1)  and  (12.2),  consider  the  OLS  estimator  of  var (uu)  =  cr2,  which  is  given  by  s2  = 
e'e/(n  —  K'),  where  e  denotes  the  vector  of  OLS  residuals,  n  =  NT  and  K'  =  K  +  1. 

(a)  Show  that  E(s2)  =  a2  +  tr2  [A''—  tr  (In  ®  Jt)-Py]/(r  —  AT'). 

(b)  Consider  the  inequalities  given  by  Kiviet  and  Kramer  (1992)  which  state  that  0  <  mean  of 
n  —  K'  smallest  roots  of  O  <  E(s 2)  <  mean  of  n  —  K'  largest  roots  of  O  <  tr(f2)/(n  —  A') 
where  O  =  E(uu').  Show  that  for  the  one-way  error  components  model,  these  bounds  are 

0  <  a2  +  (n  —  TA')cr2/(n  —  A')  <  A(s2)  <  cr2  +  ncr2/(n  —  A')  <  na2/(n  —  A'). 

As  n  — >  oo,  both  bounds  tend  to  cr2,  and  s 2  is  asymptotically  unbiased,  irrespective  of  the 
particular  evolution  of  X,  see  Baltagi  and  Kramer  (1994)  for  a  proof  of  this  result. 
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12.  Verify  the  relationship  between  M  and  M* ,  i.e. ,  MM*  =  M*,  given  below  (12.47).  Hint:  Use 
the  fact  that  Z  =  Z*I*  with  I*  =  (tjv  <8>  Ik1)- 

13.  Verify  that  M  and  M*  defined  below  (12.50)  are  both  symmetric,  idempotent  and  satisfy  MM*  = 
M*. 

14.  For  the  gasoline  data  used  in  problem  10,  verify  the  Chow-test  results  given  below  equation  (12.51). 

15.  For  the  gasoline  data,  compute  the  Breusch-Pagan,  Honda  and  Standardized  LM  tests  for  H0-, 


16.  If  /3  denotes  the  LSDV  estimator  and  (3GLS  denotes  the  GLS  estimator,  then 

(a)  Show  that  q  =  /3GLS  —  [3  satisfies  co v(q,  PGls)  =  0- 

(b)  Verify  equation  (12.56). 

17.  For  the  gasoline  data  used  in  problem  10,  replicate  the  Hausman  test  results  given  below  equation 
(12.58). 

18.  For  the  cigarette  data  given  as  CIGAR.TXT  on  the  Springer  web  site,  reproduce  the  results  given 
in  Table  12.8.  See  also  Baltagi,  Griffin  and  Xiong  (2000). 

19.  Heteroskedastic  Fixed  Effects  Models.  This  is  based  on  Baltagi  (1996).  Consider  the  fixed  effects 
model 

Pit  —  T  Ufa  l  1)2,...,  N 1  t  1,2,...,  Ti 

where  yn  denotes  output  in  industry  i  at  time  t  and  oq  denotes  the  industry  fixed  effect.  The 
disturbances  uu  are  assumed  to  be  independent  with  heteroskedastic  variances  a2.  Note  that  the 
data  are  unbalanced  with  different  number  of  observations  for  each  industry. 

(a)  Show  that  OLS  and  GLS  estimates  of  oq  are  identical. 

(b)  Let  a2  =  Tia2 /n  where  n  =  be  the  average  disturbance  variance.  Show  that 

the  GLS  estimator  of  a2  is  unbiased,  whereas  the  OLS  estimator  of  a2  is  biased.  Also  show 
that  this  bias  disappears  if  the  data  are  balanced  or  the  variances  are  homoskedastic. 

(c)  Define  X2  =  af/cr 2  for  i  =  1, 2  . . . ,  N.  Show  that  for  a'  =  (oq,  «2,  •  ■  • ,  Q-n) 

A[estimated  var(Sois)  —  true  var(aois)] 

N 

=  a'2[(n  -  A?)/(n  _  N)]  dia§  (1/Ti)  “  V2  diaS  (Xi/Ti) 

i= 1 

This  problem  shows  that  in  case  there  are  no  regressors  in  the  unbalanced  panel  data  model, 
fixed  effects  with  heteroskedastic  disturbances  can  be  estimated  by  OLS,  but  one  has  to 
correct  the  standard  errors. 

20.  The  Relative  Efficiency  of  the  Between  Estimator  with  Respect  to  the  Within  Estimator.  This  is 
based  on  Baltagi  (1999).  Consider  the  simple  panel  data  regression  model 

yu  =  a  +  f3xit  +  uit  i  =  l,2,...,N-t=l,2,...,T  (1) 


where  a  and  /3  are  scalars.  Subtract  the  mean  equation  to  get  rid  of  the  constant 
Pit  ~  y..  =  P{xu  -  x..)  +  uit  -  u.., 


(2) 
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where  x..  =  Tj^Lflff^Xit/NT  and  y..  and  u..  are  similarly  defined.  Add  and  subtract  Xi,  from  the 
regressor  in  parentheses  and  rearrange 

Vit  -  y..  =  /3(xit  -  Xi.)  +  j3 {iti.  -  x..)  +  uit  -  u..  (3) 

where  Xi .  =  T,f=1xa/T.  Now  run  the  unrestricted  least  squares  regression 

Vit  -  y..  =  /3w(xit  -  Xi.)  +  (3b{xi.  -  x.)  +  uit  -  u..  (4) 

where  (3W  is  not  necessarily  equal  to  (3b. 

(a)  Show  that  the  least  squares  estimator  of  (3W  from  (4)  is  the  Within  estimator  and  that  of  f3b 
is  the  Between  estimator. 

(b)  Show  that  if  ua  =  /q  +  zqt  where  /q  ~  IID(0,  a^)  and  oif  ~  IID(0,ctJ)  independent  of  each 
other  and  among  themselves,  then  ordinary  least  squares  (OLS)  is  equivalent  to  generalized 
least  squares  (GLS)  on  (4). 

(c)  Show  that  for  model  (1),  the  relative  efficiency  of  the  Between  estimator  with  respect  to 
the  Within  estimator  is  equal  to  {Bxx/Wxx)[{  1  —  p)/(Tp  +  (1  —  /?))],  where  Wxx  = 
SlY=1E^=1(xj t  —  Xi.)2  denotes  the  Within  variation  and  Bxx  =  T'E^L1(xi.  —  x..)2  denotes 
the  Between  variation.  Also,  p  =  <r^/(cr^  +  a2)  denotes  the  equicorrelation  coefficient. 

(d)  Show  that  the  square  of  the  t-statistic  used  to  test  H0;  f3w  =  f3b  in  (4)  yields  exactly  Haus- 
man’s  (1978)  specification  test. 

21.  For  the  crime  example  of  Cornwell  and  Trumbull  (1994)  studied  in  Chapter  11.  Use  the  panel  data 
given  as  CRIME.DAT  on  the  Springer  web  site  to  replicate  the  Between  and  Within  estimates 
given  in  Table  1  of  Cornwell  and  Trumbull  (1994).  Compute  2SLS  and  Within-2SLS  (2SLS  with 
county  dummies)  using  offense  mix  and  per  capita  tax  revenue  as  instruments  for  the  probability 
of  arrest  and  police  per  capita.  Comment  on  the  results. 

22.  Consider  the  Arellano  and  Bond  (1991)  dynamic  employment  equation  for  140  UK  companies  over 
the  period  1979-1984.  Replicate  all  the  estimation  results  in  Table  4  of  Arellano  and  Bond  (1991, 
p.  290). 
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CHAPTER  13 

Limited  Dependent  Variables 

13.1  Introduction 

In  labor  economics,  one  is  faced  with  explaining  the  decision  to  participate  in  the  labor  force, 
the  decision  to  join  a  union,  or  the  decision  to  migrate  from  one  region  to  the  other.  In  finance, 
a  consumer  defaults  on  a  loan  or  a  credit  card  debt,  or  purchases  a  stock  or  an  asset  like  a  house 
or  a  car.  In  these  examples,  the  dependent  variable  is  usually  a  dummy  variable  with  values  1  if 
the  worker  participates  (or  consumer  defaults  on  a  loan)  and  0  if  he  or  she  does  not  participate 
(or  default).  We  dealt  with  dummy  variables  as  explanatory  variables  on  the  right  hand  side  of 
the  regression,  but  what  additional  problems  arise  when  this  dummy  variable  appears  on  the 
left  hand  side  of  the  equation?  As  we  have  done  in  previous  chapters,  we  first  study  its  effects 
on  the  usual  least  squares  estimator,  and  then  consider  alternative  estimators  that  are  more 
appropriate  for  models  of  this  nature. 


13.2  The  Linear  Probability  Model 

What  is  wrong  with  running  OLS  on  this  model?  After  all,  it  is  a  feasible  procedure.  For  the 
labor  force  participation  example  one  regresses  the  dummy  variable  for  participation  on  age, 
sex,  race,  marital  status,  number  of  children,  experience  and  education,  etc.  The  prediction 
from  this  OLS  regression  is  interpreted  as  the  likelihood  of  participating  in  the  labor  force.  The 
problems  with  this  interpretation  are  the  following: 

(i)  We  are  predicting  probabilities  of  participation  for  each  individual,  whereas  the  actual  values 
observed  are  0  or  1. 

(ii)  There  is  no  guarantee  that  yt.  the  predicted  value  of  yi  is  going  to  be  between  0  and  1.  In 
fact,  one  can  always  find  values  of  the  explanatory  variables  that  would  generate  a  corresponding 
prediction  outside  the  (0, 1)  range. 

(iii)  Even  if  one  is  willing  to  assume  that  the  true  model  is  a  linear  regression  given  by 

yi  =  x'i(3  +  Ui  i  =  1,2, ...  ,n.  (13.1) 

what  properties  does  this  entail  on  the  disturbances?  It  is  obvious  that  y%  =  1  only  when 
Ui  =  1  —  x'i/3,  let  us  say  with  probability  tt*.  where  7Tj  is  to  be  determined.  Then  y*  =  0  only 
when  m  =  —  x(/3  with  probability  (1  —  7r,.).  For  the  disturbances  to  have  zero  mean 

E(ui)  =  7Ti(l  -  x'iP)  +  (1  -  7Tj)(-x'/3)  =  0  (13.2) 

Solving  for  ,k1.  one  gets  that  7Tj  =  x'/3.  This  also  means  that 

var('Uj)  =  7Tj(l  -  7 Tj)  =  x'/3(l  -  x[(3)  (13.3) 

which  is  heteroskedastic.  Goldberger  (1964)  suggests  correcting  for  this  heteroskedasticity  by 
first  running  OLS  to  estimate  /3,  and  estimating  of  =  var(uj)  by  o\  =  x'if3OLi S(1  —  x'i^OLS)  = 
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%(1  —  yi).  In  the  next  step  a  Weighted  Least  Squares  (WLS)  procedure  is  run  on  (13.1)  with 
the  original  observations  divided  by  <7j.  One  cannot  compute  dt  if  OLS  predicts  yi  larger  than 
1  or  smaller  than  0.  Suggestions  in  the  literature  include  substituting  0.005  instead  of  y?;  <  0, 
and  0.995  for  %  >  1.  However,  these  procedures  do  not  perform  well,  and  the  WLS  predictions 
themselves  are  not  guaranteed  to  fall  in  the  (0, 1)  range.  Therefore,  one  should  use  the  robust 
White  heteroskedastic  variance-covariance  matrix  option  when  estimating  linear  probability 
models,  otherwise  the  standard  errors  are  biased  and  inference  is  misleading. 

This  brings  us  to  the  fundamental  problem  with  OLS,  i.e.,  its  functional  form.  We  are  trying 
to  predict 

Vi  =  F{x'iP)  +  Ui  (13.4) 

with  a  linear  regression  equation,  see  Figure  13.1,  where  the  more  reasonable  functional  form 
for  this  probability  is  an  S'-shaped  cumulative  distribution  functional  form.  This  was  justified 
in  the  biometrics  literature  as  follows:  An  insect  has  a  tolerance  to  an  insecticide  I*,  which  is 
an  unobserved  random  variable  with  cumulative  distribution  function  (c.d.f.)  F.  If  the  dosage 
of  insecticide  administered  induces  a  stimulus  Ii  that  exceeds  I*,  the  insect  dies,  i.e.,  yi  =  1. 
Therefore 


Pr (Vi  =  1)  =  Pr (I*  <  Ii )  =  F{Ii)  (13.5) 

To  put  it  in  an  economic  context,  I*  could  be  the  unobserved  reservation  wage  of  a  worker,  and  if 
we  increase  the  offered  wage  beyond  that  reservation  wage,  the  worker  participates  in  the  labor 
force.  In  general,  could  be  represented  as  a  function  of  the  individuals  characteristics,  i.e.,  the 
xf  s.  F(x,i/3)  is  by  definition  between  zero  and  1  for  all  values  of  Xi.  Also,  the  linear  probability 
model  yields  the  result  that  diri/dxk  =  (3k,  for  every  i.  This  means  that  the  probability  of 
participating  (7 r*)  always  changes  at  the  same  rate  with  respect  to  unit  increases  in  the  offer 
wage  xk ■  However,  this  probability  model  gives 

diTi/dxk  =  [ dF(zi)/dzi\  ■  [dzi/dxk\  =  /(x-/3)  •  (3k  (13.6) 

where  z%  =  x^/3,  and  /  is  the  probability  density  function  (p.d.f.).  Equation  (13.6)  makes  more 
sense  because  if  xk  denotes  the  offered  wage,  changing  the  probability  of  participation  7Tj  from 
0.96  to  0.97  requires  a  larger  change  in  xk  than  changing  7Tj  from  0.23  to  0.24. 

If  F(x'iP)  is  the  true  probability  function,  assuming  it  is  linear  introduces  misspecification,  and 
as  Figure  13.1  indicates,  for  Xi  <  X£,  all  the  ufs  generated  by  a  linear  probability  approximation 
are  positive.  Similarly  for  all  Xi  >  xu,  all  the  uf  s  generated  by  a  linear  probability  approximation 
are  negative. 


13.3  Functional  Form:  Logit  and  Probit 

Having  pointed  out  the  problems  with  considering  the  functional  form  F  as  linear,  we  turn  to 
two  popular  functional  forms  of  F,  the  logit  and  the  probit.  These  two  c.d.f. ’s  differ  only  in  the 
tails,  and  the  logit  resembles  the  c.d.f.  of  a  f-distribution  with  7  degrees  of  freedom,  whereas 
the  probit  is  the  normal  c.d.f.,  or  that  of  a  t  with  00  degrees  of  freedom.  Therefore,  these  two 
forms  will  give  similar  predictions  unless  there  are  an  extreme  number  of  observations  in  the 
tails. 


13.3  Functional  Form:  Logit  and  Probit 


335 


Figure  13.1  Linear  Probability  Model 


We  will  use  the  conventional  notation  =  f~_  (fr(u)du,  where  (j}(z)  =  e~z2  /  \/2tt  for 

— oo  <  z  <  oo,  for  the  probit.  Also,  A(z)  =  e2/(l  +  e2)  =  1/(1  +  e-2)  for  — oo  <  z  <  +oo, 
for  the  logit.  Some  results  that  we  will  use  quite  often  in  our  derivations  are  the  following: 
d^/dz  =  (j>,  and  dA/dz  =  A(1  —  A).  The  p.d.f.  of  the  logistic  distribution  is  the  product  of  its 
c.d.f.  and  one  minus  this  c.d.f.  Therefore,  the  marginal  effects  considered  above  for  a  general  F 
are  respectively, 


d<P(xiP)/dxk  =  4>t/3k 


(13.7) 


and 


dA(x'i(3)/dxk  =  Aj(l  -  Ai)f3k 


(13.8) 


where  (j)i  =  (pix'JJ)  and  A j  =  A(x'i/3). 

One  has  to  be  careful  with  the  computation  of  partial  derivatives  in  case  there  is  a  dummy 
variable  among  the  explanatory  variables.  For  such  models,  one  should  compute  the  marginal 
effects  of  a  change  in  one  unit  of  a  continuous  variable  xk  for  both  values  of  the  dummy  variable. 


Illustrative  Example:  Using  the  probit  model,  suppose  that  the  probability  of  joining  a  union 
is  estimated  as  follows:  tt j  =  <J>(2.5  —  0.06  WKSi  +  0.95  OCCi)  where  WKS  is  the  number  of 
weeks  worked  and  OCC  =  1,  if  the  individual  is  in  a  blue-collar  occupation,  and  zero  otherwise. 
Weeks  worked  in  this  sample  range  from  20  to  50.  From  (13.7),  the  marginal  effect  of  one  extra 
week  of  work  on  the  probability  of  joining  the  union  is  given  by: 
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For  Blue-Collar  Workers:  —0.06  0[2.5  —  0.06  WKS  +  0.95] 

=  -0.06  (j)[ 2.25]  =  -0.002  at  WKS  =  20 
=  -0.06  0[1.35]  =  -0.010  at  WKS  =  35 
=  -0.06  <£[0.45]  =  -0.022  at  WKS  =  50 

For  Non  Blue-Collar  Workers:  —0.06  0[2.5  —  0.06  WKS] 

=  -0.06  <£[  1.3]  =  -0.010  at  WKS  =  20 
=  -0.06  <£[0.4]  =  -0.022  at  WKS  =  35 
=  -0.06  <£[-0.5]  =  -0.021  at  WKS  =  50 

Note  how  different  these  marginal  effects  are  for  blue-collar  versus  non  blue-collar  workers  even 
for  the  same  weeks  worked.  Increasing  weeks  worked  from  20  to  21  reduces  the  probability  of 
joining  the  union  by  0.002  for  a  Blue-Collar  worker.  This  is  compared  to  five  times  that  amount 
for  a  Non  Blue-Collar  worker. 


13.4  Grouped  Data 

In  the  biometrics  literature,  grouped  data  is  very  likely  from  laboratory  experiments,  see  Cox 
(1970).  In  the  insecticide  example,  every  dosage  level  Xi  is  administered  to  a  group  of  insects  of 
size  rij,  and  the  proportion  of  insects  that  die  are  recorded  ( pi ).  This  is  done  for  i  =  1,2, ...  ,M 
dosage  levels. 

P[Vi  =  1]  =  Tfi  =  P[I*  <  h]  =  $(<*  +  pXi) 

where  I*  is  the  tolerance  and  =  a  +  /3xi  is  the  stimulus.  In  economics,  observations  may  be 
grouped  by  income  levels  or  age  and  we  observe  the  labor  participation  rate  for  each  income 
or  age  group.  For  this  type  of  grouped  data,  we  estimate  the  probability  of  participating  in 
the  labor  force  7 Tj  with  pi ,  the  proportion  from  the  sample.  This  requires  a  large  number  of 
observations  in  each  group,  i.e. ,  a  large  rii  for  i  =  1,2, ... ,  M.  In  this  case,  the  approximation  is 

Zi  =  $_1(pi)  =  a  +  Pxi  (13.9) 

for  each  pi,  we  compute  the  standardized  normal  variates,  the  Zi  s,  and  we  have  an  estimate  of 
a  +  f3xi .  Note  that  the  standard  normal  distribution  assumption  is  not  restrictive  in  the  sense 
that  if  I*  is  N(p,a2)  rather  than  N(0,1),  then  one  standardizes  the  P[I*  <  Ii]  by  subtracting 
/i  and  dividing  by  a ,  in  which  case  the  new  I*  is  1V(0, 1)  and  the  new  a  is  (a  —  p)/cr,  whereas 
the  new  /3  is  /3/a.  This  also  implies  that  p,  and  a  are  not  separately  estimable.  A  plot  of  the  Zi  s 
versus  the  x^s  would  give  estimates  of  a  and  /3.  For  the  biometrics  example,  one  can  compute 
LD 50,  which  is  the  dosage  level  that  will  kill  50%  of  the  insect  population.  This  corresponds  to 
Zi  =  0,  which  solves  for  x^  =  —a/ (3.  Similarly,  LD 95  corresponds  to  Zi  =  1.645,  which  solves  for 
Xi  =  (1.645  —  a)//3.  Alternatively,  for  the  economic  example,  LD 50  is  the  minimum  reservation 
wage  that  is  necessary  for  a  50%  labor  participation  rate. 

One  could  improve  on  this  method  by  including  more  x’s  on  the  right  hand  side  of  (13.9). 
In  this  case,  one  can  no  longer  plot  the  Zi  values  versus  the  x  variables.  However,  one  can  run 
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OLS  of  z  on  these  x’s.  One  problem  remains,  OLS  ignores  the  heteroskedasticity  in  the  error 
term.  To  see  this: 

Pi  =  TTi  +  €i  =  F(x'i/3)  +  a  (13.10) 

where  F  is  a  general  c.d.f.  and  7Tj  =  F(x'i/3).  Using  the  properties  of  the  binomial  distribution, 
E(pi)  =  ni  and  var (jpi)  =  7Tj(l  —  7Tj)/nj.  Defining  zt  =  F ~1(pi),  we  obtain  from  (13.10) 

Zi  =  F^1(pi)  =  F~l{TTi  +  ei)  =  F_1(7Tj)  +  [dF-1(7ri)/d7Ti]ei  (13.11) 

where  the  approximation  =  is  a  Taylor  series  expansion  around  t Tj  with  — >  0.  Since  F  is 
monotonic  7 r*  =  F(F~1(TTi)).  Let  Wi  =  F_1( TTi)  =  x^/3,  differentiating  with  respect  to  7r  gives 

1  =  [dF(wi)  /  dwi]dwi/ dTTi  (13.12) 

Alternatively,  this  can  be  rewritten  as 

dF-1  (TTi) /dTTi  =  dwi/d-Ki  =  1  /  {dF(w/)  /  dwi]  =  1  /  f(w/)  =  l/f(x'iP)  (13.13) 

where  /  is  the  probability  density  function  corresponding  to  F.  Using  (13.13),  equation  (13.11) 
can  be  rewritten  as 

Zi  =  F~1(pi)  =  i?_1(7Tj)  +  ei/ f(x'if3)  (13.14) 

=  F~1(F(x,i/3))  +  ei/f(x'i(3)  =  x'/3  +  ei/f(x'i0) 

From  (13.14),  it  is  clear  that  the  regression  disturbances  of  Zi  on  xt  are  given  by  Ui  =  ti/ f(x'i(3), 
with  E(ui)  =  0  and  of  =  var (m)  =  var(ej)//2(x'/3)  =  7r*(l  -  7r i)/(rnf/)  =  F)(  1  -  Fi)/(mf f) 
since  t Tj  =  F)  where  the  subscript  i  on  /  or  F  denotes  that  the  argument  of  that  function  is  x(/3  . 
This  heteroskedasticity  in  the  disturbances  renders  OLS  on  (13.14)  consistent  but  inefficient.  For 
the  probit ,  of  =  4>j(l  —  4>j)/(nj^f ),  and  for  the  logit ,  of  =  l/[njAj(l  —  A*)],  since  /)  =  Aj(l  — Aj). 
Using  l/(Ji  as  weights,  a  WLS  procedure  can  be  performed  on  (13.14).  Note  that  F~1(p)  for 
the  logit  is  simply  log[p/(l  —  p)].  This  is  one  more  reason  why  the  logistic  functional  form  is 
so  popular.  In  this  case  one  regresses  log[p/(l  —  p)]  on  x  correcting  for  heteroskedasticity  using 
WLS.  This  procedure  is  also  known  as  the  minimum  logit  chi-square  method  and  is  due  to 
Berkson  (1953). 

In  order  to  obtain  feasible  estimates  of  the  a/ s,  one  could  use  the  OLS  estimates  of  f3  from 
(13.14),  to  estimate  the  weights.  Greene  (1993)  argues  that  one  should  not  use  the  proportions 
Pi  s  as  estimates  for  the  7Tj’s  because  this  is  equivalent  to  using  the  yf' s  instead  of  of  in  the  het- 
eroskedastic  regression.  These  will  lead  to  inefficient  estimates.  If  OLS  on  (13.14)  is  reported  one 
should  use  the  robust  White  heteroskedastic  variance-covariance  option,  otherwise  the  standard 
errors  are  biased  and  inference  is  misleading. 

Example  1:  Beer  Taxes  and  Motor  Vehicle  Fatality.  Ruhrn  (1996)  used  grouped  logit  analysis 
with  fixed  time  and  state  effects  to  study  the  impact  of  beer  taxes  and  a  variety  of  alcohol-control 
policies  on  motor  vehicle  fatality  rates.  Ruhm  collected  panel  data  for  48  states  (excluding 
Alaska,  Hawaii  and  the  District  of  Columbia)  over  the  period  1982-1988.  The  dependent  variable 
is  a  proportion  p,  denoting  the  total  vehicle  fatality  rate  per  capita  for  state  i  at  time  t.  One 
can  perform  the  inverse  logit  transformation,  log[p/(l  —  p)],  provided  p  is  not  zero  or  one,  and 
run  the  usual  fixed  effects  regression  described  in  Chapter  12.  Denote  this  dependent  variable 
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by  (LFVR).  The  explanatory  variables  included  the  real  beer  tax  rate  on  24  (12  oz.)  containers 
of  beer  (BEERTAX),  the  minimum  legal  drinking  age  (MLDA)  in  years,  the  percentage  of  the 
population  living  in  dry  counties  (DRY),  the  average  number  of  vehicle  miles  per  person  aged 
16  and  over  (VMILES),  and  the  percentage  of  young  drivers  (15-24  years  old)  (YNGDRV).  Also 
some  dummy  variables  indicating  the  presence  of  alcohol  regulations.  These  include  BREATH 
test  laws  which  is  a  dummy  variable  that  takes  the  value  1  if  the  state  authorized  the  police 
to  administer  pre-arrest  breath  test  to  establish  probable  cause  for  driving  under  the  influence 
(DUI).  JAILD  which  takes  the  value  of  1  if  the  state  passed  legislation  mandating  jail  or 
community  service  (COMSERD)  for  the  first  DUI  conviction.  Other  variables  included  are  the 
unemployment  rate,  real  per  capita  income,  and  state  and  time  dummy  variables.  Details  on 
these  variables  are  given  in  Table  1  of  Ruhrn  (1996).  Some  of  the  variables  in  this  data  set 
can  be  downloaded  from  the  Stock  and  Watson  (2003)  web  site  at  www.aw.com/stock_watson. 
Table  13.1  replicate  to  the  extent  possible  the  grouped  logit  regression  results  in  column  (d) 
of  Table  2,  p.  444  of  Ruhm  (1996).  This  regression  does  not  include  some  of  the  other  alcohol 
regulations  that  were  not  provided  in  the  data  set.  These  were  replicated  using  the  robust  White 
cross-section  option  in  EViews. 


Table  13.1  Grouped  Logit,  Beer  Tax  and  Motor  Vehicle  Fatality 


Dependent  Variable:  LVFR 

Method:  Panel  Least  Squares 

Sample:  1982  1988 

Cross-sections  included:  48 

Total  panel  (unbalanced)  observations: 
White  cross-section  standard  errors  &  < 
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covariance  (d.f.  corrected) 

Variable  Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

-9.361589 

0.132383 

-70.71570 

0.0000 

BEERTAX 

-0.183533 

0.077658 

-2.363344 

0.0188 

MLDA 

-0.004465 

0.007814 

-0.571427 

0.5682 

DRY 

0.008677 

0.002611 

3.323523 

0.0010 

YNGDRV 

0.493472 

0.450802 

1.094651 

0.2746 

VMILES 

6.91E-06 

4.23E-06 

1.632849 

0.1037 

BREATH 

-0.015930 

0.027952 

-0.569893 

0.5692 

JAILD 

-0.012623 

0.038973 

-0.323888 

0.7463 

COMSERD 

0.020238 

0.024505 

0.825867 

0.4096 

PERINC 

0.060305 

0.010275 

5.868945 

0.0000 

Effects  Specification 

Cross-section  fixed  (dummy  variables) 

Period  fixed  (dummy  variables) 

R-squared 

0.931852 

Mean  dependent  var 

-8.534768 

Adjusted  R-squared 

0.916318 

S.D.  dependent  var 

0.276471 

S.E.  of  regression 

0.079977 

Akaike  info  criterion 

-2.046359 

Sum  squared  resid 

1.739808 

Schwarz  criterion 

-1.329075 

Log  likelihood 

405.7652 

F-statistic 

59.98854 

Durbin- Watson  stat 

1.433777 

Prob(F-statistic) 

0.000000 
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Table  13.1  shows  that  the  beer  tax  is  negative  and  significant,  while  the  minimum  legal  drinking 
age  is  not  significant.  Neither  is  the  breath  test  law,  JAILD  or  COMSERD  variables,  all  of  which 
represent  state  alcohol  safety  related  legislation.  Income  per  capita  and  the  percentage  of  the 
population  living  in  dry  counties  have  a  positive  and  significant  effect  on  motor  vehicle  fatality 
rates.  The  state  dummy  variables  are  jointly  significant  with  an  observed  F-value  of  34.9  which  is 
distributed  as  F( 47, 272).  The  year  dummies  are  jointly  significant  with  an  observed  F-value  of 
2.97  which  is  distributed  as  F( 6, 272).  Problem  12  asks  the  reader  to  replicate  Table  13.1.  These 
results  imply  that  increasing  the  minimum  legal  drinking  age,  or  imposing  stiffer  punishments 
like  mandating  jail  or  community  service  are  not  effective  policy  tools  for  decreasing  traffic 
related  deaths.  However,  increasing  the  real  tax  on  beer  is  an  effective  policy  for  reducing  traffic 
related  deaths. 

For  grouped  data,  the  sample  sizes  n*  for  each  group  have  to  be  sufficiently  large.  Also,  the 
pfs  cannot  be  zero  or  one.  One  modification  suggested  in  the  literature  is  to  add  (1/2 m)  to  pi 
when  computing  the  log  of  odds  ratio,  see  Cox  (1970). 


Example  2:  Fractional  Response.  Papke  and  Wooldridge  (1996)  argue  that  in  many  economic 
settings  Pi  may  be  0  or  1  for  a  large  number  of  observations.  For  example,  when  studying  par¬ 
ticipation  rates  in  pension  plans  or  when  studying  high  school  graduation  rates.  They  propose  a 
fractional  logit  regression  which  handles  fractional  response  variables  based  on  quasi-likelihood 
methods.  Fractional  response  variables  are  bounded  variables.  Without  loss  of  generality,  they 
could  be  restricted  to  he  between  0  and  1.  Examples  include  the  proportion  of  income  spent  on 
charitable  contributions,  the  fraction  of  total  weekly  hours  spent  working.  Papke  and  Wooldridge 
(1996)  propose  modeling  the  E(jji/xi )  as  a  logistic  function  A (x'/3).  This  insures  that  the  pre¬ 
dicted  value  of  m  lies  in  the  interval  (0, 1).  It  is  also  well  defined  even  if  m  takes  the  values  0  or 
1  with  positive  probability.  It  is  important  to  note  that  in  case  yi  is  a  proportion  from  a  group 
of  known  size  m,  the  quasi  maximum  likelihood  method  ignores  the  information  on  m.  Using 
the  Bernoulli  log-likelihood  function,  one  gets 

Li{P)  =  Vi  log[A(x'/3)]  +  (1  -  yi)  log[l  -  A(x'iP)\ 


for  i  =  1, 2, . . . ,  n,  with  0  <  A(x(/3)  <  1. 

Maximizing  Lj(/3)  with  respect  to  /3  yields  the  quasi-MLE  which  is  consistent  and  y/n 
asymptotically  normal  regardless  of  the  distribution  of  yt  conditional  on  Xi,  see  Gourieroux, 
Monfort  and  Trognon  (1984)  and  McCullagh  and  Nelder  (1989).  The  latter  proposed  the  gen¬ 
eralized  linear  models  (GLM)  approach  to  this  problem  in  statistics.  Logit  QMLE  can  be  done 
in  Stata  using  the  GLM  command  with  the  Binary  family  function  indicating  Bernoulli  and  the 
Link  function  indicating  the  logistic  distribution. 

Papke  and  Wooldridge  (1996)  derive  robust  asymptotic  variance  of  the  QMLE  of  (3  and  sug¬ 
gest  some  specification  tests  based  on  Wooldridge  (1991).  They  apply  their  methods  to  the 
participation  in  401  (K)  pension  plans.  The  data  are  from  the  1987  IRS  Form  5500  reports  of 
pension  plans  with  more  than  100  participants.  This  data  set  containing  4734  observations 
can  be  downloaded  from  the  Journal  of  Applied  Econometrics  Data  Archive.  We  focus  on  a 
subset  of  their  data  which  includes  3874  observations  of  plans  with  match  rates  less  than  or 
equal  to  one.  Match  rates  above  one  may  be  indicating  end-of-plan  year  employer  contribu¬ 
tions  made  to  avoid  IRS  disqualification.  Participation  rates  (PRATE)  in  this  sample  are  high 
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Table  13.2  Logit  Quasi-MLE  of  Participation  Rates  in  401(K)  Plan 


glm  prate  mrate  log_emp  log_emp2  age  age2  sole  if  one==l, 

f(bin)  l(logit)  robust 

note:  prate  has  non-integer  values 

Iteration  0: 

log  pseudo-likelihood  = 

-1200.8698 

Iteration  1: 

log  pseudo-likelihood  = 

-1179.3843 

Iteration  2: 

log  pseudo-likelihood  = 

-1179.2785 

Iteration  3: 

log  pseudo-likelihood  = 

-1179.2785 

Generalized  linear  models 

Number  of  obs 

3784 

Optimization 

:  ML:  Newton-Raphson 

Residual  df 

3777 

Scale  parameter 

1 

Deviance 

=  1273.60684 

(1/df)  Deviance 

=  .3372006 

Pearson 

=  724.4199889 

(1/df)  Pearson 

=  .1917977 

Variance  function 

2 

T - 1 

* 

ii 

> 

[Bernoulli] 

Link  function 

:  g(u)  =  ln(u/(l-u)) 

[Logit] 

Standard  errors 

:  Sandwich 

Log  pseudo-likelihood 

=  —1179.278516 

BIC 

=  -29843.34715 

AIC 

=  .6269971 

prate 

Coef. 

Robust 
Std.  Err. 

z 

P>\z\ 

[95%  Conf.  Interval] 

mrate 

1.39008 

.1077064 

12.91 

0.000 

1.17898 

1.601181 

log_emp 

-1.001874 

.1104365 

-9.07 

0.000 

-1.218326 

-.7854229 

log_emp2 

.0521864 

.0071278 

7.32 

0.000 

.0382161 

.0661568 

age 

.0501126 

.0088451 

5.67 

0.000 

.0327766 

.0674486 

age2 

-.0005154 

.0002117 

-2.43 

0.015 

-.0009303 

-.0001004 

sole 

.0079469 

.0502025 

0.16 

0.874 

-.0904482 

.1063421 

_cons 

5.057997 

.4208646 

12.02 

0.000 

4.233117 

5.882876 

averaging  84.8%.  Over  40%  of  the  plans  have  a  participation  proportion  of  one.  This  makes 
the  log-odds  ratio  approach  awkward  since  adjustments  have  to  be  made  to  more  than  40% 
of  the  observations.  The  plan  match  rate  (MRATE)  averages  about  41  cents  on  the  dollar. 
Other  explanatory  variables  include  total  firm  employment  (EMP),  age  of  the  plan  (AGE), 
a  dummy  variable  (SOLE)  which  takes  the  value  of  1  if  the  401(K)  plan  is  the  only  pension 
plan  offered  by  the  employer.  The  401  (K)  plans  average  12  years  in  age,  they  are  the  SOLE 
plan  in  37%  of  the  sample.  The  average  employment  is  4622.  Problem  14  asks  the  reader 
to  replicate  the  descriptive  statistic  given  in  Table  I  of  Papke  and  Wooldridge  (1996,  p.  627). 
Table  13.2  gives  the  Stata  output  for  logit  QMLE  using  the  same  specification  given  in  Table  II 
of  Papke  and  Wooldridge  (1996,  p.  628).  Note  that  it  uses  the  GLM  command,  the  Bernoulli 
variance  function  and  the  logit  link  function.  The  results  show  that  there  is  a  positive  and 
significant  relationship  between  match  rate  and  participation  rate.  All  the  other  variables  in¬ 
cluded  are  significant  except  for  SOLE.  Problem  14  asks  the  reader  to  replicate  this  result 
and  compare  with  OLS.  The  latter  turns  out  to  have  a  lower  R 2  and  fails  a  RESET  test,  see 
Chapter  8. 
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13.5  Individual  Data:  Probit  and  Logit 

When  the  number  of  observations  rii  in  each  group  is  small,  one  cannot  obtain  reliable  esti¬ 
mates  of  the  7Tj’s  with  the  pi  s.  In  this  case,  one  should  not  group  the  observations,  instead 
these  observations  should  be  treated  as  individual  observations  and  the  model  estimated  by  the 
maximum  likelihood  procedure.  The  likelihood  is  obtained  as  independent  random  draws  from 
a  Bernoulli  distribution  with  probability  of  success  7 r*  =  F(x'/3)  =  P[yi  =  1].  Hence 

t = w=iinmyi[i  -  (13.15) 

and  the  log-likelihood 

log^  =  Y,?=i{ydogF(xiP)  +  (1  -  y*)log[l  -  F(z'/3)]}  (13.16) 

The  first-order  conditions  for  maximization  require  the  score  S(/3)  =  Slog l/d/3  to  be  zero: 

S((3)  =  dlogi/ d(3  =  EiU {[fiVi/Fi]  -  (1  -  yi)[fi/(  1  -  Fi)]}xi  (13.17) 

=  £r=i(2/*  -  F^fiXi/m  1  -  Fi)}  =  0 

where  the  subscript  *  on  /  or  F  denotes  that  the  argument  of  that  function  is  x\(d  .  For  the 
logit  model  (13.17)  reduces  to 

S(/3)  =  Yli=i(yi  ~  ^i)xi  =  0  since  fc  =  Aj(l  -  A*)  (13.18) 

If  there  is  a  constant  in  the  model,  the  solution  to  (13.18)  for  Xi  =  1  implies  that  = 

A j.  This  means  that  the  number  of  participants  in  the  sample,  i.e. ,  those  with  yi  =  1,  will 
always  be  equal  to  the  predicted  number  of  participants  from  the  logit  model.  Similarly,  if  Xi 
is  a  dummy  variable  which  is  1  if  the  individual  is  male  and  zero  if  the  individual  is  female, 
then  (13.18)  states  that  the  predicted  frequency  is  equal  to  the  actual  frequency  for  males  and 
females.  Note  that  (13.18)  resembles  the  OLS  normal  equations  if  we  interpret  (y*  —  A i)  as 
residuals.  For  the  probit  model  (13.17)  reduces  to 

S(P)  =  EILi (yi  -  $i)<W[$i( i  -  $0]  (13.19) 

=  0  ^oi%i  +  ^2yi= 1  ^\iXi  =  0 

where  A „  =  —  $*]  for  y*  =  0  and  Ah  =  4>i/&i  for  y^  =  1.  Also,  'Fyi=o  denotes  the  sum  over 

all  zero  values  of  y*.  These  Aj’s  are  thought  of  as  generalized  residuals  which  are  orthogonal  to 
Xi .  Note  that  unlike  the  logit,  the  probit  does  not  necessarily  predict  the  number  of  participants 
to  be  exactly  equal  to  the  number  of  ones  in  the  sample. 

Equations  (13.17)  are  highly  nonlinear  and  may  be  solved  using  the  scoring  method,  i.e., 
starting  with  some  initial  value  /30  we  revise  this  estimate  as  follows: 

ft  =  Po  +  [I-\Po)}S(Po)  (13.20) 

where  £(/?)  =  dlogl/d f3  and  I(/3)  =  E[—d2\ogi/dj3dfd'].  This  process  is  repeated  until  conver¬ 
gence.  For  the  logit  and  probit  models,  log F(x'iP)  and  log[l  —  F(xli/3)]  are  concave.  Hence,  the 
log-likelihood  function  given  by  (13.16)  is  globally  concave ,  see  Pratt  (1981).  Hence,  for  both 
the  logit  and  probit,  [<92log £/d/3d/3']  is  negative  definite  for  all  values  of  j3  and  the  iterative 
procedure  will  converge  to  the  unique  maximum  likelihood  estimate  Pmle  no  matter  what 
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starting  values  we  use.  In  this  case,  the  asymptotic  covariance  matrix  of  /3MLe  is  estimated  by 
I~1(Pmle)  from  the  last  iteration. 

Amemiya  (1981,  p.  1495)  derived  /(/?)  by  differentiating  (13.17),  multiplying  by  a  negative 
sign  and  taking  the  expected  value,  the  result  is  given  by: 


m  =  -E[d2\ogl/dpdp'}  =  £"=  1  ffrix'i/Fiil  -  Fi) 

(13.21) 

For  the  logit,  (13.21)  reduces  to 

J(P)  =  EF=i  Ai(!  -  ^i)xix'i 

(13.22) 

For  the  probit,  (13.21)  reduces  to 

KP)  =  E"=  1  ~ 

(13.23) 

Alternative  maximization  may  use  the  Newton-Raphson  iterative  procedure  which  uses  the 
Hessian  itself  rather  than  its  expected  value  in  (13.20),  i.e. ,  I(/3)  is  replaced  by  H(/3)  = 
[— <92log £/d(3d/3'].  For  the  logit  model,  H(/3)  =  I(/3 )  and  is  given  in  (13.22).  For  the  probit 
model,  H(/3)  =  +  \%x\P\xix\  which  is  different  from  (13.23).  Note  that  A i  =  A ai  if 

Hi  =  0;  and  A*  =  Aij  if  y*  =  1.  These  were  defined  below  (13.19). 

A  third  method,  suggested  by  Berndt,  Hall,  Hall  and  Hausman  (1974)  uses  the  outer  product 
of  the  first  derivatives  in  place  of  /(/?),  i.e.,  G(/3 )  =  S(/3)S'(/3).  For  the  logit  model,  this  is 
G(P)  =  £V, {yi  —  A i)2Xix\.  For  the  probit  model,  G(jp)  =  l  ^ixix'i-  As  ln  the  method  of 
scoring,  one  iterates  starting  from  initial  estimates  /3a,  and  the  asymptotic  variance-covariance 
matrix  is  estimated  from  the  inverse  of  G(/3),  H(/3)  or  I(/3)  in  the  last  iteration. 

Test  of  hypotheses  can  be  carried  out  from  the  asymptotic  standard  errors  using  f-statistics. 
For  R(3  =  r  type  restrictions,  the  usual  Wald  test  W  =  ( Rj3  —  r),[RV(/3)R']~1(Rf3  —  r)  can 
be  used  with  V(/3)  obtained  from  the  last  iteration  as  described  above.  Likelihood  ratio  and 
Lagrange  Multiplier  statistics  can  also  be  computed.  LR  =  —2  [log (-restricted,—  log £unrestricted\, 
whereas,  the  Lagrange  Multiplier  statistic  is  LM  =  S' (/3)V (/3) S (/3) ,  where  S(/3)  is  the  score 
evaluated  at  the  restricted  estimator.  Davidson  and  MacKinnon  (1984)  suggest  that  V(/3)  based 
on  I is  the  best  of  the  three  estimators  to  use.  In  fact,  Monte  Carlo  experiments  show  that 
the  estimate  of  V(P)  based  on  the  outer  product  of  the  first  derivatives  usually  performs  the 
worst  and  is  not  recommended  in  practice.  All  three  statistics  are  asymptotically  equivalent 
and  are  asymptotically  distributed  as  Xq  where  q  is  the  number  of  restrictions.  The  next  section 
discusses  tests  of  hypotheses  using  an  artificial  regression. 


13.6  The  Binary  Response  Model  Regression1 

Davidson  and  MacKinnon  (1984)  suggest  a  modified  version  of  the  Gauss-Newton  regression 
(GNR)  considered  in  Chapter  8  which  is  useful  in  the  context  of  a  binary  response  model 
described  in  (13. 5). 2  In  fact,  we  have  shown  that  this  model  can  be  written  as  a  nonlinear 
regression 

yi  =  F(x'ip)  +  Ui  (13.24) 

with  Ui  having  zero  mean  and  var(iij)  =  Ft(l  —  iq) .  The  GNR  ignoring  heteroskedasticity  yields 
{yi  -  Fi)  =  fix'jb  +  residual 
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where  b  is  the  regression  estimates  when  we  regress  (yi  —  Fi)  on  fix[. 

Correcting  for  heteroskedasticity  by  dividing  each  observation  by  its  standard  deviation  we 
get  the  Binary  Response  Model  Regression  (BRMR): 


(Vi  -  Fi) 
y/Fi(l  - 


fi 

y/Fi(  1  -  Fi) 


x\b  +  residual 


(13.25) 


For  the  logit  model  with  fi  =  Aj(l  —  Aj),  this  simplifies  further  to 
Vi  ~  Aj 


Vfi 


=  \fjix'ib  +  residual 


(13.26) 


For  the  probit  model,  the  BRMR  is  given  by 


Vi  ~ 

-  $0 


-  $i) 


x'ib  +  residual 


(13.27) 


Like  the  GNR  considered  in  Chapter  8,  the  BRMR  given  in  (13.25)  can  be  used  for  obtaining 
parameter  and  covariance  matrix  estimates  as  well  as  test  of  hypotheses.  In  fact,  Davidson  and 
MacKinnon  point  out  that  the  transpose  of  the  dependent  variable  in  (13.25)  times  the  matrix 
of  regressors  in  (13.25)  yields  a  vector  whose  typical  element  is  exactly  that  of  S(P)  given  in 
(13.17).  Also,  the  transpose  of  the  matrix  of  regressors  in  (13.25)  multiplied  by  itself  yields  a 
matrix  whose  typical  element  is  exactly  that  of  /(/?)  given  in  (13.21). 

Let  us  consider  how  the  BRMR  is  used  to  test  hypotheses.  Suppose  that  /3 '  =  (ffl}  ff2)  where 
Pi  is  of  dimension  k  —  r  and  P2  is  of  dimension  r.  We  want  to  test  Ha ;  P2  =  0.  Let  P  =  (Pi,  0) 
be  the  restricted  MLE  of  [3  subject  to  H0.  In  order  to  test  H0,  we  run  the  BRMR: 


Vi  F 

\ Jm-Fi ) 


fi 


Fi{  1  -  Fi) 


x'n  h  + 


fi 


F(  1  -  Fi) 


x'i2b  2  +  residual 


(13.28) 


where  x\  =  (x'ii,x'i2)  has  been  partitioned  into  vectors  conformable  with  the  corresponding 
partition  of  p.  Also,  Fi  =  F(x(/3)  and  fi  =  f(x,iP).  The  suggested  test  statistic  for  H0  is  the 
explained  sum  of  squares  of  the  regression  (13.28).  This  is  asymptotically  distributed  as  Xr 
under  H0f  A  special  case  of  this  BRMR  is  that  of  testing  the  null  hypothesis  that  all  the  slope 
coefficients  are  zero.  In  this  case,  xn  =  1  and  Pi  is  the  constant  a.  Problem  2  shows  that  the 
restricted  MLE  in  this  case  is  F(a)  =  y  or  a  =  F_1(y),  where  y  is  the  proportion  of  the  sample 
with  yi  =  1.  Therefore,  the  BRMR  in  (13.25)  reduces  to 


Vi-V 

Vvi1  ~  v) 


fi{a) 

vfi-l) 


bi  + 


fi(u) 

Vvi1  -  v) 


x'i2b2  +  residual 


(13.29) 


Note  that  y{  1  —  y)  is  constant  for  all  observations.  The  test  for  &2  =  0  is  not  affected  by  dividing 
the  dependent  variable  or  the  regressors  by  this  constant,  nor  is  it  affected  by  subtracting  a 
constant  from  the  dependent  variable.  Hence,  the  test  for  b2  =  0  can  be  carried  out  by  regressing 
yi  on  a  constant  and  Xi2  and  testing  that  the  slope  coefficients  of  Xi2  are  zero  using  the  usual 
least  squares  E-statistic.  This  is  a  simpler  alternative  to  the  likelihood  ratio  test  proposed  in 
the  previous  section  and  described  in  the  empirical  example  in  section  13.9.  For  other  uses  of 
the  BRMR,  see  Davidson  and  MacKinnon  (1993). 
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13.7  Asymptotic  Variances  for  Predictions  and  Marginal  Effects 

Two  results  of  interest  after  estimating  the  model  are:  the  predictions  F(x' (3)  and  the  marginal 
effects  dF/dx  =  f(x'P)p.  For  example,  given  the  characteristics  of  an  individual  x,  we  can 
predict  his  or  her  probability  of  purchasing  a  car.  Also,  given  a  change  in  x,  say  income, 
one  can  estimate  the  marginal  effect  this  will  have  on  the  probability  of  purchasing  a  car. 
The  latter  effect  is  constant  for  the  linear  probability  model  and  is  given  by  the  regression 
coefficient  of  income,  whereas  for  the  probit  and  logit  models  this  marginal  effect  will  vary 
with  the  Xi’s,  see  (13.7)  and  (13.8).  These  marginal  effects  can  be  computed  with  Stata  using 
the  dprobit  command.  The  default  is  to  compute  them  at  the  sample  mean  x.  There  is  also 
the  additional  problem  of  computing  variances  for  these  predictions  and  marginal  effects.  Both 
F(x'/3)  and  f(x'/3)/3  are  nonlinear  functions  of  the  P’s.  To  compute  standard  errors,  we  can  use 
the  following  linear  approximation  which  states  that  whenever  8  =  F(/3)  then  the  asy.var(#|  = 
(dF/d/3yV(P)(dF/d(3).  For  the  predictions,  let  z  =  x'/3  and  denote  by  F  =  F(x'/3)  and  /  = 
f(x'P),  then 

dF /dp  =  (dF/dz)(dz/dp)  =  fx  and  asy.var(F)  =  f2x'V(P)x. 

For  the  marginal  effects,  let  7  =  fP,  then 

asy.var(7)  =  (dp/dp  )V(P)(dp/dp  )'  (13.30) 

where  dp/d'p  =  flk  +  P(df/dz)(dz/dp')  =  flk  +  ( df/dz)0x 

For  the  probit  model,  df/dz  =  dp/dz  =  —zp.  So,  dp/dp  =  p[Ik  —  zPx']  and 

asy.var(7)  =  p2[Ik  -  x' PPx'] v(p) [Ik  -  x'PPx']1  (13.31) 

For  the  logit  model,  /  =  A(1  —  A),  so 

df/dz  =  (1  —  2A)(i9A/<9z)  =  (1  —  2A)(/)  =  (1  —  2A)A(1  —  A) 
dp/dp  =  A(l-A)[Ik  +  (l-2A)Px'} 
and  (13.30)  becomes 

asy.var(7)  =  [A(l  -  A )}2[Ik  +  (1  -  2A ]Px'}V0)[Ik  +  (1  -  2A )Px'}'  (13.32) 

13.8  Goodness  of  Fit  Measures 

There  are  problems  with  the  use  of  conventional  i?2-type  measures  when  the  explained  variable 
y  takes  on  two  values,  see  Maddala  (1983,  pp.  37-41).  The  predicted  values  y  are  probabilities 
and  the  actual  values  of  y  are  either  0  or  1  so  the  usual  R2  is  likely  to  be  very  low.  Also,  if  there 
is  a  constant  in  the  model  the  linear  probability  and  logit  models  satisfy  YH/=xVi  =  Yl?=iVi- 
However,  the  probit  model  does  not  necessarily  satisfy  this  exact  relationship  although  it  is 
approximately  valid. 

Several  ii2-type  measures  have  been  suggested  in  the  literature,  some  of  these  are  the  follow¬ 
ing: 

(i)  The  squared  correlation  between  y  and  y:  R2  =  r2 


13.9  Empirical  Examples  345 


(ii)  Measures  based  on  the  residual  sum  of  squares:  Effron  (1978)  suggested  using 

*2  =  1-  [£r=i (w  -  ft)2/X£=  M  -  y )2]  =  1  -  [«Er=i(w  -  ?.)2M 

since  YA=\{Vi  ~  v)2  =  Ya= i  Vi  ~  nV 2  =  -  n(ni/n)2  =  nin2/n,  where  m  =  X^=i  Vi  and 

ri2  =  n  —  n\. 

Amemiya  (1981,  p.  1504)  suggests  using  [Y2i=i(yi  ~  Vi)2 /Vii)-  ~  Vi)\  as  the  residual  sum 
of  squares.  This  weights  each  squared  error  by  the  inverse  of  its  variance. 

(iii)  Measures  based  on  likelihood  ratios:  =  1  —  (£r/£u)2^n  where  ir  is  the  restricted  likeli¬ 

hood  and  £u  is  the  unrestricted  likelihood.  This  tests  that  all  the  slope  coefficients  are  zero 
in  the  standard  linear  regression  model.  For  the  limited  dependent  variable  model  however, 
the  likelihood  function  has  a  maximum  of  1.  This  means  that  £r  <  lu  <  1  or  £r  <  (£r/£u)  < 
1  or  £2^n  <  1  —  R2  <  1  or  0  <  R\  <  1  —  $n .  Hence,  Cragg  and  Uhler  (1970)  suggest  a 
pseudo- R2  that  lies  between  0  and  1,  and  is  given  by  R\  =  (£^/n  —  &n)/[(  1  —  $n)/£%In\. 
Another  measure  suggested  by  McFadden  (1974)  is  R\  =  1  —  {fog£u/\og£r). 

(iv)  Proportion  of  correct  predictions:  After  computing  y,  one  classifies  the  z-th  observation 
as  a  success  if  y*  >  0.5,  and  a  failure  if  y*  <  0.5.  This  measure  is  useful  but  may  not  have 
enough  discriminatory  power. 
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Example  1:  Union  Participation 

To  illustrate  the  logit  and  probit  models,  we  consider  the  PSID  data  for  1982  used  in  Chapter  4. 
In  this  example,  we  are  interested  in  modelling  union  participation.  Out  of  the  595  individuals 
observed  in  1982,  218  individuals  had  their  wage  set  by  a  union  and  377  did  not.  The  explanatory 
variables  used  are:  years  of  education  (ED),  weeks  worked  (WKS),  years  of  full-time  work 
experience  (EXP),  occupation  ( OCC  =  1,  if  the  individual  is  in  a  blue-collar  occupation), 
residence  ( SOUTH  =  1,  SMSA  =  1,  if  the  individual  resides  in  the  South,  or  in  a  standard 
metropolitan  statistical  area),  industry  ( IND  =  1,  if  the  individual  works  in  a  manufacturing 
industry),  marital  status  (MS  =  1,  if  the  individual  is  married),  sex  and  race  (FEM  =  1, 
BLK  =  1,  if  the  individual  is  female  or  black).  A  full  description  of  the  data  is  given  in  Cornwell 
and  Rupert  (1988).  The  results  of  the  linear  probability,  logit  and  probit  models  are  given  in 
Table  13.3.  These  were  computed  using  EViews.  In  fact  Table  13.4  gives  the  probit  output. 
We  have  already  mentioned  that  the  probit  model  normalizes  a  to  be  1.  But,  the  logit  model 
has  variance  7r2/3.  Therefore,  the  logit  estimates  tend  to  be  larger  than  the  probit  estimates 
although  by  a  factor  less  than  ir/y/3.  In  order  to  make  the  logit  results  comparable  to  those  of 
the  probit,  Amemiya  (1981)  suggests  multiplying  the  logit  coefficient  estimates  by  0.625. 

Similarly,  to  make  the  linear  probability  estimates  comparable  to  those  of  the  probit  model 
one  needs  to  multiply  these  coefficients  by  2.5  and  then  subtract  1.25  from  the  constant  term. 
For  this  example,  both  logit  and  probit  procedures  converged  quickly  in  4  iterations.  The  log- 
likelihood  values  and  McFadden’s  (1974)  R2  obtained  for  the  last  iteration  are  recorded. 
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Table  13.3  Comparison  of  the  Linear  Probability,  Logit  and  Probit  Models:  Union  Participation* 


Variable 

OLS 

Logit 

Probit 

EXP 

-.005  (1.14) 

-.007  (1.15) 

-.007  (1.21) 

WKS 

-.045  (5.21) 

-.068  (5.05) 

-.061  (5.16) 

OCC 

.795  (6.85) 

1.036  (6.27) 

.955  (6.28) 

IND 

.075  (0.79) 

.114  (0.89) 

.093  (0.76) 

SOUTH 

-.425  (4.27) 

-.653  (4.33) 

-.593  (4.26) 

SMSA 

.211  (2.20) 

.280  (2.05) 

.261  (2.03) 

MS 

.247  (1.55) 

.378  (1.66) 

.351  (1.62) 

FEM 

-.272  (1.37) 

-.483  (1.58) 

-.407  (1.47) 

ED 

-.040  (1.88) 

-.057  (1.85) 

-.057  (1.99) 

BLK 

.125  (0.71) 

.222  (0.90) 

.226  (0.99) 

Const 

1.740  (5.27) 

2.738  (3.27) 

2.517  (3.30) 

Log-likelihood 

-312.337 

-313.380 

McFadden’s  R2 

0.201 

0.198 

Xw 

157.2 

155.1 

*  Figures  in  parentheses  are  t-statistics 


Note  that  the  logit  and  probit  estimates  yield  similar  results  in  magnitude,  sign  and  significance. 
One  would  expect  different  results  from  the  logit  and  probit  only  if  there  are  several  observations 
in  the  tails.  The  following  variables  were  insignificant  at  the  5%  level:  EXP,  IND,  MS,  FEM 
and  BLK.  The  results  show  that  union  participation  is  less  likely  if  the  individual  resides  in  the 
South  and  more  likely  if  he  or  she  resides  in  a  standard  metropolitan  statistical  area.  Union 
participation  is  also  less  likely  the  more  the  weeks  worked  and  the  higher  the  years  of  education. 
Union  participation  is  more  likely  for  blue-collar  than  non  blue-collar  occupations.  The  linear 
probability  model  yields  different  estimates  from  the  logit  and  probit  results.  OLS  predicts  two 
observations  with  y)  >  1,  and  29  observations  with  %  <  0.  Table  13.5  gives  the  actual  versus 
predicted  values  of  union  participation  for  the  linear  probability,  logit  and  probit  models.  The 
percentage  of  correct  predictions  is  75%  for  the  linear  probability  and  probit  model  and  76% 
for  the  logit  model. 

One  can  test  the  significance  of  all  slope  coefficients  by  computing  the  LR  based  on  the 
unrestricted  log-likelihood  value  (log£u)  reported  in  Table  13.3,  and  the  restricted  log-likelihood 
value  including  only  the  constant.  The  latter  is  the  same  for  both  the  logit  and  probit  models 
and  is  given  by 

log4  =  n[ylogy  +  (1  -  y)log(l  -  y)]  (13.33) 

where  y  is  the  proportion  of  the  sample  with  y%  =  1,  see  problem  2.  In  this  example,  y  = 
218/595  =  0.366  and  n  =  595  with  log lr  =  —390.918.  Therefore,  for  the  probit  model, 

LR  =  —  2[log£r  -  log4]  =  — 2[ — 390.918  +  313.380]  =  155.1 

which  is  distributed  as  Xio  under  the  null  of  zero  slope  coefficients.  This  is  highly  significant 
and  the  null  is  rejected.  Similarly,  for  the  logit  model  this  LR  statistic  is  157.2.  For  the  linear 
probability  model,  the  same  null  hypothesis  of  zero  slope  coefficients  can  be  tested  using  a 
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Table  13.4  Probit  Estimates:  Union  Participation 


Dependent  Variable: 
Method: 

Sample: 

Included  observations: 


UNION 

ML  -  Binary  Probit 

1  595 

595 


Convergence  achieved  after  5  iterations 
Covariance  matrix  computed  using  second  derivatives 


Variable 

Coefficient 

Std.  Error 

z-Statistic 

Prob. 

EX 

-0.006932 

0.005745 

-1.206491 

0.2276 

WKS 

-0.060829 

0.011785 

-5.161666 

0.0000 

OCC 

0.955490 

0.152137 

6.280476 

0.0000 

IND 

0.092827 

0.122774 

0.756085 

0.4496 

SOUTH 

-0.592739 

0.139102 

-4.261183 

0.0000 

SMSA 

0.260700 

0.128630 

2.026741 

0.0427 

MS 

0.350520 

0.216284 

1.620648 

0.1051 

FEM 

-0.407026 

0.277038 

-1.469203 

0.1418 

ED 

-0.057382 

0.028842 

-1.989515 

0.0466 

BLK 

0.226482 

0.228845 

0.989675 

0.3223 

C 

2.516784 

0.762612 

3.300217 

0.0010 

Mean  dependent  var 

0.366387 

S.D.  dependent  var 

0.482222 

S.E.  of  regression 

0.420828 

Akaike  info  criterion 

1.090351 

Sum  squared  resid 

103.4242 

Schwarz  criterion 

1.171484 

Log  likelihood 

-313.3795 

Hannan-Quinn  criter. 

1.121947 

Restr.  log  likelihood 

-390.9177 

Avg.  log  likelihood 

-0.526688 

LR  statistic  (10  df) 

155.0763 

McFadden  R-squared 

0.198349 

Probability (LR  stat) 

0.000000 

Obs  with  Dep=0 

377 

Total  obs 

595 

Obs  with  Dep=l 

218 

Table  13.5  Actual  Versus  Predicted:  Union  Participation 

Predicted 

Total 

Union  = 

0 

Union  =  1 

Union  =0 

OLS 

=  312 

OLS  = 

65 

377 

LOGIT 

=  316 

LOGIT  = 

61 

Probit 

=  314 

Probit  = 

63 

Actual 

Union  =1 

OLS 

=  83 

OLS  = 

135 

218 

LOGIT 

=  82 

LOGIT  = 

136 

Probit 

=  86 

Probit  = 

132 

OLS 

=  395 

OLS  = 

200 

595 

Total 

LOGIT 

=  398 

LOGIT  = 

197 

Probit 

=  400 

Probit  = 

195 
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Chow  E-statistic.  This  yields  an  observed  value  of  17.80  which  is  distributed  as  E(10,584) 
under  the  null  hypothesis.  Again,  the  null  is  soundly  rejected.  This  F- test  is  in  fact  the  BRMR 
test  considered  in  section  13.6.  As  described  in  section  13.8,  McFadden’s  R2  is  given  by  R2  = 
1  —  [logf'u/log^r]  which  for  the  probit  model  yields 

#5  =  1  —  (313.380/390.918)  =  0.198. 

For  the  logit  model,  McFadden’s  R2  is  0.201. 


Example  2:  Employment  and  Problem  Drinking 

Mullahy  and  Sindelar  (1996)  estimate  a  linear  probability  model  relating  employment  and  mea¬ 
sures  of  problem  drinking.  The  analysis  is  based  on  the  1988  Alcohol  Supplement  of  the  National 
Health  Interview  Survey.  This  regression  was  performed  for  Males  and  Females  separately  since 
the  authors  argue  that  women  are  less  likely  than  men  to  be  alcoholic,  are  more  likely  to  ab¬ 
stain  from  consumption,  and  have  lower  mean  alcohol  consumption  levels.  They  also  report 
that  women  metabolize  ethanol  faster  than  do  men  and  experience  greater  liver  damage  for 
the  same  level  of  consumption  of  ethanol.  The  dependent  variable  takes  the  value  1  if  the 
individual  was  employed  in  the  past  two  weeks  and  zero  otherwise.  The  explanatory  variables 
included  the  90th  percentile  of  ethanol  consumption  in  the  sample  (18  oz.  for  males  and  10.8  oz. 
for  females)  and  zero  otherwise.  This  variables  is  denoted  by  hvdrnk90.  The  state  unemploy¬ 
ment  rate  in  1988  (UE88),  Age,  Age2,  schooling,  married,  family  size,  and  white.  Health  status 
dummies  indicating  whether  the  individual’s  health  was  excellent,  very  good,  fair.  Region  of 
residence,  whether  the  individual  resided  in  the  northeast,  midwest  or  south.  Also,  whether 
he  or  she  resided  in  center  city  (msal)  or  other  metropolitan  statistical  area  (not  center  city, 
msa2).  Three  additional  dummy  variables  were  included  for  the  quarters  in  which  the  survey 
was  conducted.  Details  on  the  definitions  of  these  variables  are  given  in  Table  1  of  Mullahy 
and  Sindelar  (1996).  Table  13.6  gives  the  probit  results  based  on  n  =  9822  males  using  Stata. 
These  results  show  a  negative  relationship  between  the  90th  percentile  alcohol  variable  and 
the  probability  of  being  employed,  but  this  has  a  p-value  of  0.075.  Mullahy  and  Sindelar  find 
that  for  both  men  and  women,  problem  drinking  results  in  reduced  employment  and  increased 
unemployment.  Table  13.7  gives  the  marginal  effects  computed  in  Stata  using  the  mfx  option 
after  probit  estimation.  The  marginal  effects  are  computed  at  the  sample  mean  of  the  variables, 
except  in  the  case  of  dummy  variables  where  it  is  done  for  a  discrete  change  from  0  to  1.  For 
example,  the  marginal  effect  of  being  a  heavy  drinker  in  the  upper  90th  percentile  of  ethanol 
consumption  in  the  sample,  (given  that  all  the  other  variables  are  evaluated  at  their  mean  and 
dummy  variables  are  changing  from  0  to  1),  is  to  decrease  the  probability  of  employment  by 
1.6%.  These  can  also  be  computed  at  particular  values  of  the  explanatory  variables  with  the 
option  at  in  Stata.  In  fact  Table  13.8  gives  the  average  marginal  effect  for  all  males.  This  can 
be  computed  using  the  margeff  command  in  Stata.  In  this  case  the  average  marginal  effect  for 
a  heavy  drinker  (-.0165)  did  not  change  much  from  the  marginal  effect  computed  at  the  sample 
mean  (-.0162)  and  neither  did  the  standard  error  (.0096  compared  with  .0093).  The  goodness 
of  fit  as  measured  by  how  well  this  probit  classifies  the  predicted  probabilities  is  given  in  Ta¬ 
ble  13.9  using  the  estat  classification  option  in  Stata.  The  percentage  of  correct  predictions  is 
90.79%.  Problem  13  asks  the  reader  to  verify  these  results  as  well  as  those  in  the  original  article 
by  Mullahy  and  Sindelar  (1996). 
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Table  13.6  Probit  Estimates:  Employment  and  Problem  Drinking 


.  probit  emp  hvdrnk90  ue88  age  agesq  educ  married  famsize  white  hlstatl  hlstat2  hlstat3  hlstat4 
regionl  region2  region3  msal  msa2  ql  q2  q3,  robust 


Probit  regression 


Number  of  obs 
Wald  chi2(20) 
Prob  >  chi2 


9822 

928.33 

0.0000 


Log  pseudolikelihood  = 

2698.1797 

Pseudo  R2 

=  0.1651 

emp 

Coef. 

Robust 

Stcl.  Err. 

z 

P>\z\ 

[95%  Conf.  Interval] 

hvdrnk90 

-.1049465 

.0589881 

-1.78 

0.075 

-.2205612 

.0106681 

ue88 

-.0532774 

.0142025 

-3.75 

0.000 

-.0811137 

-.0254411 

age 

.0996338 

.0171185 

5.82 

0.000 

.0660821 

.1331855 

agesq 

-.0013043 

.0002051 

-6.36 

0.000 

-.0017062 

-.0009023 

educ 

.0471834 

.0066739 

7.07 

0.000 

.0341029 

.0602639 

married 

.2952921 

.0540858 

5.46 

0.000 

.189286 

.4012982 

famsize 

.0188906 

.0140463 

1.34 

0.179 

-.0086398 

.0464209 

white 

.3945226 

.0483381 

8.16 

0.000 

.2997818 

.4892634 

hlstatl 

1.816306 

.0983447 

18.47 

0.000 

1.623554 

2.009058 

hist at 2 

1.778434 

.0991531 

17.94 

0.000 

1.584098 

1.972771 

hist at 3 

1.547836 

.0982637 

15.75 

0.000 

1.355243 

1.74043 

hlstat4 

1.043363 

.1077279 

9.69 

0.000 

.8322205 

1.254506 

regionl 

.0343123 

.0620021 

0.55 

0.580 

-.0872096 

.1558341 

region2 

.0604907 

.0537885 

1.12 

0.261 

-.0449327 

.1659142 

region3 

.1821206 

.0542346 

3.36 

0.001 

.0758227 

.2884185 

msal 

-.0730529 

.0518719 

-1.41 

0.159 

-.1747199 

.0286141 

msa2 

.0759533 

.0513092 

1.48 

0.139 

-.0246109 

.1765175 

ql 

-.1054844 

.0527728 

-2.00 

0.046 

-.2089171 

-.0020516 

q2 

-.0513229 

.0528185 

-0.97 

0.331 

-.1548453 

.0521995 

q3 

-.0293419 

.0543751 

-0.54 

0.589 

-.1359152 

.0772313 

_cons 

-3.017454 

.3592321 

-8.40 

0.000 

-3.721536 

-2.313372 

Example  3:  Fertility  and  Same  Sex  of  Previous  Children 

Carrasco  (2001)  estimated  a  probit  equation  for  fertility  using  PSID  data  over  the  period  1986- 
1989.  The  sample  consists  of  1,442  married  or  cohabiting  women  between  the  ages  of  18  and 
55  in  1986.  The  dependent  variable  fertility  (f)  is  specified  by  a  dummy  variable  that  equals  1 
if  the  age  of  the  youngest  child  in  the  next  year  is  1.  The  explanatory  variables  are:  (ags261) 
which  is  a  dummy  variable  that  equals  1  if  the  woman  has  a  child  between  2  and  6  years 
old;  education  which  has  three  levels  (educ_l,  educ_2  and  educ_3),  the  female’s  age,  race,  and 
husband’s  income.  An  indicator  of  same  sex  of  previous  children  (dsex),  and  its  components: 
(dsexf)  for  girls,  and  (dsexm)  for  boys.  This  variable  exploits  the  widely  observed  phenomenon 
of  parental  preferences  for  a  mixed  sibling-sex  composition  in  developed  countries.  Therefore,  a 
dummy  for  whether  the  sex  of  the  next  child  matches  the  sex  of  the  previous  children  provides  a 
plausible  predictor  for  additional  childbearing.  The  data  set  can  be  obtained  from  the  Journal  of 
Business  &  Economic  Statistics  archive  data  web  site.  Problem  15  asks  the  reader  to  replicate 
some  of  the  results  obtained  in  the  original  article  by  Carrasco  (2001).  The  estimates  reveal 
that  having  children  of  the  same  sex  has  a  significant  and  positive  effect  on  the  probability  of 
having  an  additional  child.  The  marginal  effect  of  same  sex  children  increases  the  probability 
of  fertility  by  3%,  see  Table  13.10.  These  are  obtained  using  the  dprobit  command  in  Stata. 
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Table  13.7  Marginal  Effects:  Employment  and  Problem  Drinking 


.  mfx  compute 


Marginal  effects  after  probit 

y  =  Pr(emp)  (predict) 

=  .92244871 


variable 

dy/dx 

Std.  Err. 

z 

P>\z\ 

[95%  Conf.  Interval] 

X 

hvdrnk90* 

-.0161704 

.00962 

-1.68 

0.093 

-.035034 

.002693 

.099165 

ue88 

-.0077362 

.00205 

-3.78 

0.000 

-.011747 

-.003725 

5.56921 

age 

.0144674 

.00248 

5.83 

0.000 

.009607 

.019327 

39.1757 

agesq 

-.0001894 

.00003 

-6.37 

0.000 

-.000248 

-.000131 

1627.61 

educ 

.0068513 

.00096 

7.12 

0.000 

.004966 

.008737 

13.3096 

married* 

.0488911 

.01009 

4.85 

0.000 

.029119 

.068663 

.816432 

famsize 

.002743 

.00204 

1.35 

0.179 

-.001253 

.006739 

2.7415 

white* 

.069445 

.01007 

6.90 

0.000 

.049709 

.089181 

.853085 

hist  at  1* 

.2460794 

.01484 

16.58 

0.000 

.216991 

.275167 

.415903 

hist  at  2* 

.1842432 

.00992 

18.57 

0.000 

.164799 

.203687 

.301873 

hist  at  3* 

.130786 

.00661 

19.80 

0.000 

.11784 

.143732 

.205254 

hlstat4* 

.0779836 

.00415 

18.77 

0.000 

.069841 

.086126 

.053451 

regionl* 

.0049107 

.00875 

0.56 

0.575 

-.012233 

.022054 

.203014 

region2* 

.0086088 

.0075 

1.15 

0.251 

-.006092 

.023309 

.265628 

region3* 

.0252543 

.00715 

3.53 

0.000 

.011247 

.039262 

.318265 

msal* 

-.0107946 

.00779 

-1.39 

0.166 

-.026061 

.004471 

.333232 

msa2* 

.0109542 

.00735 

1.49 

0.136 

-.003456 

.025365 

.434942 

ql* 

-.0158927 

.00825 

-1.93 

0.054 

-.032053 

.000268 

.254632 

q2* 

-.0075883 

.00795 

-0.95 

0.340 

-.023167 

.007991 

.252698 

q3* 

-.0043066 

.00807 

-0.53 

0.594 

-.020121 

.011508 

.242822 

(*)  dy/dx  is  for  discrete  change  of  dummy  variable  from  0  to  1 
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In  many  economic  situations,  the  choice  may  be  among  m  alternatives  where  m  >  2.  These  may 
be  unordered  alternatives  like  the  selection  of  a  mode  of  transportation,  bus,  car  or  train,  or 
an  occupational  choice  like  lawyer,  carpenter,  teacher,  etc.,  or  they  may  be  ordered  alternatives 
like  bond  ratings,  or  the  response  to  an  opinion  survey,  which  could  vary  from  strongly  agree  to 
strongly  disagree.  Ordered  response  multinomial  models  utilize  the  extra  information  implicit  in 
the  ordinal  nature  of  the  dependent  variable.  Therefore,  these  models  have  a  different  likelihood 
than  unordered  response  multinomial  models  and  have  to  be  treated  separately. 


13.10.1  Ordered  Response  Models 

Suppose  there  are  three  bond  ratings,  A,  AA  and  AAA.  We  sample  n  bonds  and  the  7-th  bond 
is  rated  A  (which  we  record  as  yi  =  0)  if  its  performance  index  I*  <  0,  where  0  is  again  not 
restrictive.  I*  =  x\(i  +  Ui,  so  the  probability  of  an  A  rating  or  the  Pr[y*  =  0]  is 


vr u  =  P 4Vi  =  0]  =  P[It  <  0]  =  P[ui  <  -x'/3]  =  Fi-x'iP) 


(13.34) 
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Table  13.8  Average  Marginal  Effects:  Employment  and  Problem  Drinking 


.  margeff 


Average  partial  effects  after  probit 


y  = 

Pr(emp) 

variable 

Coef. 

Std.  Err. 

z 

P>\z\ 

[95%  Conf.  Interval] 

hvdrnk90 

-.0164971 

.009264 

1.78 

0.075 

-.0346543 

.00166 

ue88 

-.0078854 

.0019748 

-3.99 

0.000 

-.011756 

-.0040149 

age 

.0147633 

.0024012 

6.15 

0.000 

.010057 

.0194697 

agesq 

-.000193 

.0000287 

-6.73 

0.000 

-.0002493 

-.0001368 

educ 

.0069852 

.0009316 

7.50 

0.000 

.0051593 

.0088112 

married 

.048454 

.0070149 

6.91 

0.000 

.0347051 

.0622028 

famsize 

.002796 

.0019603 

1.43 

0.154 

-.0010461 

.0066382 

white 

.0685255 

.0062822 

10.91 

0.000 

.0562127 

.0808383 

hist at 1 

.2849987 

.0059359 

48.01 

0.000 

.2733645 

.2966328 

hist at 2 

.2318828 

.0049776 

46.59 

0.000 

.2221269 

.2416386 

hist at 3 

.1725703 

.0049899 

34.58 

0.000 

.1627903 

.1823502 

hlstat4 

.0914458 

.0048387 

18.90 

0.000 

.0819621 

.1009295 

regionl 

.0050178 

.0083778 

0.60 

0.549 

-.0114025 

.021438 

region2 

.0088116 

.0071262 

1.24 

0.216 

-.0051556 

.0227787 

region3 

.0259534 

.0064999 

3.99 

0.000 

.0132139 

.0386929 

msal 

-.0109515 

.007632 

-1.43 

0.151 

-.02591 

.0040071 

msa2 

.0111628 

.0067952 

1.64 

0.100 

-.0021556 

.0244811 

ql 

-.0160925 

.0080458 

-2.00 

0.045 

-.0318619 

-.0003231 

q2 

-.0077086 

.0076973 

-1.00 

0.317 

-.0227951 

.0073779 

q3 

-.0043814 

.0077835 

-0.56 

0.573 

-.0196368 

.010874 

Table  13.9  Actual  vs  Predicted:  Employment  and  Problem  Drinking 


.  estat  class 
Probit  model  for  emp 

True 


Classified 

D 

~D 

Total 

+ 

8743 

826 

9569 

- 

79 

174 

253 

Total 

8822 

1000 

9822 

Classified  +  if  predicted  Pr(D)  >=  .5 

True  D  defined  as  emp  !=  0 

Sensitivity 

Pr(  +|  D) 

99.10% 

Specificity 

Pr(  H~D) 

17.40  % 

Positive  predictive  value 

Pr(  D|  +) 

91.37% 

Negative  predictive  value 

Pr(~D|  -) 

68.77% 

False  +  rate  for  true  ~D 

Pr(  +|~D) 

82.60% 

False  -  rate  for  true  D 

Pr(  -|  D) 

0.90% 

False  +  rate  for  classified 

Pr(~D|  +) 

8.63% 

False  -  rate  for  classified 

Pr(  D  -) 

31.23% 

Correctly  classified 

90.79% 
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Table  13.10  Marginal  Effects:  Fertility  and  Same  Sex  of  Previous  Children 


.  dprobit  f  dsex  ags261  educ_2  educ_3  age  drace  inc 

Probit  regression,  reporting  marginal  effects 

Number  of  obs  = 

5768 

LR  chi2(7) 

964.31 

Prob  >  chi2  = 

0.0000 

Log  likelihood  =  1561.1312 

Pseudo  R2  = 

0.2360 

f 

dF/dx 

Std.  Err. 

z 

P>\z\ 

x-bar 

[95%  Conf.  Interval] 

dsex* 

.0302835 

.0069532 

5.40 

0.000 

.256415 

.016655 

.043912 

ags261* 

-.1618148 

.0066629 

-13.22 

0.000 

.377601 

-.174874 

-.148756 

educ_2* 

.0022157 

.0090239 

0.24 

0.808 

.717753 

-.015471 

.019902 

educ_3* 

.0288636 

.0140083 

2.45 

0.014 

.223994 

.001408 

.056319 

age 

-.0065031 

.0007644 

-16.65 

0.000 

32.8024 

-.008001 

-.005005 

drace* 

-.0077119 

.0055649 

-1.45 

0.146 

.773232 

-.018619 

.003195 

inc 

.0002542 

.000241 

1.06 

0.289 

12.8582 

-.000218 

.000727 

obs.  P 

.1137309 

pred.  P 

.0367557 

(at  xbar) 

(*)  dF/dx  is  for  discrete  change  of  dummy  variable  from  0  to  1 

z  and  P  >  \z\  correspond  to  the  test  of  the  underlying  coefficient  being  0 


The  i-th  bond  is  rated  A  A  (which  we  record  as  yi  =  1)  if  its  performance  index  I*  is  between 
0  and  c  where  c  is  a  positive  number,  with  probability 


vt2 i  =  Pr [yi  =  1]  =  P[0  <  I*  <  c }  (13.35) 

=  P[0  <  x'/3  +  Ui  <  c]  =  F(c  -  x't/3)  —  F(— x^/3) 

The  i-th  bond  is  rated  AAA  (which  we  record  as  yt  =  2)  if  I*  >  c,  with  probability 

vr3  i  =  Pr  [yi  =  2]  =  P[I*  >  c\  =  P[x'/3  +  ut  >  c]  =  1  -  F(c  -  x'/3)  (13.36) 

F  can  be  the  logit  or  probit  function.  The  log-likelihood  function  for  the  ordered  probit  is  given 
by 

log  £(P,c)  =  YJyi=olo%(®(-xiP))  +  YJyi=ilo&[®(c-x\P)-$(-x'i(3)]  (13.37) 

+  Y,yi=2l°Z{1-®(C-X'M- 

For  the  probabilities  given  in  (13.34),  (13.35)  and  (13.36),  the  marginal  effects  of  changes  in  the 
regressors  are: 

d-Ku/dxi  =  -/(-x'/3)/3  (13.38) 

d-K2i/dxi  =  [fi-x'iP)  -  /(c-  x[l3)]/3  (13.39) 

dn  3i/dxi  =  /(c-x-/3)/ 3  (13.40) 


Generalizing  this  model  to  m  bond  ratings  is  straight  forward.  The  likelihood,  the  score  and 
the  Hessian  for  the  m-ordered  probit  model  are  given  in  Maddala  (1983,  pp.  47-49). 
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Illustrative  Example:  Corporate  Bond  Rating.  This  data  set  is  obtained  from  Baum(2006)  by 
issuing  the  following  command  in  Stata: 

. use  http : //www . stata-press . com/data/ imeus/panel84extract ,  clear 

This  data  set  contains  ratings  of  98  corporate  bonds  coded  as  2  to  5  ( rating83c ).  The  rating  2 
corresponds  to  the  lowest  rating  BA_B_C  and  5  to  the  highest  rating  AAA.  These  are  given  in 
Table  13.11. 

Table  13.11  Corporate  Bond  Rating 


.  tab  rating83c 


Bond  rating,  1982 

Freq. 

Percent 

Cum. 

BA_B_C 

26 

26.53 

26.53 

BAA 

28 

28.57 

55.10 

AA_A 

15 

15.31 

70.41 

AAA 

29 

29.59 

100.00 

Total 

98 

100.00 

This  is  modeled  as  an  ordered  logit  with  two  explanatory  variables:  ia83,  the  income  to  asset 
ratio  in  1983,  and  the  change  in  that  ratio  from  1982  to  1983  ( dia ).  The  summary  statistics  are 
given  in  Table  13.12. 

Table  13.12  Ordered  Logit 


Variable 

Obs 

Mean 

Std.  Dev. 

Min 

Max 

rating83c 

98 

3.479592 

1.17736 

2 

5 

ia83 

98 

10.11473 

7.441946 

-13.08016 

30.74564 

dia 

98 

.7075242 

4.711211 

-10.79014 

20.05367 

.  ologit  rating83c  ia83  dia 


Ordered  logistic  regression 

Number  of  obs 

=  98 

LR  chi2(2) 

=  11.54 

Prob  >  chi2 

=  0.0021 

Log  likelihood  =  -127.27146 

Pseudo  R2 

=  0.0434 

rating83c 

Coef. 

Std.  Err. 

z 

P>\z\ 

[95%  Conf.  Interval] 

ia83 

.0939166 

.0296196 

3.17 

0.002 

.0358633 

.1519699 

dia 

-.0866925 

.0449789 

-1.93 

0.054 

-.1748496 

.0014646 

/cutl 

-.1853053 

.3571432 

-.8852932 

.5146826 

/  cut2 

1.185726 

.3882099 

.4248488 

1.946604 

/  cut3 

1.908412 

.4164896 

1.092108 

2.724717 

Income/assets  has  a  positive  effect  on  the  rating,  while  the  change  in  that  ratio  has  a  negative 
effect!  Both  are  significant. 
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Table  13.13  Predicted  Bond  Rating 


.  predict  pbabc  pbaa  paa  paaa,  pr 
.  sum  pbabc  pbaa  paa  paaa,  separator(O) 


Variable 

Obs 

Mean 

Std.  Dev. 

Min 

Max 

pbabc 

98 

.2729981 

.1224448 

.0388714 

.7158453 

pbaa 

98 

.2950074 

.0456984 

.0985567 

.3299373 

paa 

98 

.1496219 

.0274841 

.0449056 

.1787291 

paaa 

98 

.2823726 

.1304381 

.0466343 

.7528986 

The  /cutl  to  /cut 3  give  the  estimated  thresholds  of  the  ratings  categories  using  a  logit  specifi¬ 
cation.  The  first  one  is  not  significant,  while  the  other  two  are.  Note  that  the  95%  confidence 
interval  for  these  thresholds  overlap. 

We  can  predict  the  probability  of  achieving  this  rating  using  the  predict  command  naming 
the  values  we  want  it  to  take  for  each  category,  see  Table  13.13.  The  average  of  these  predictions 
is  pretty  close  to  the  actual  frequencies  observed  in  each  category. 

13.10.2  Unordered  Response  Models 

There  are  m  choices  each  with  probability  tth  ,  7^2, . . . ,  7 Tjm  for  individual  i.  yij  =  1  if  individual 
i  chooses  alternative  j,  otherwise  it  is  0.  This  means  that  ^2jLi  Uij  =  1  and  J2jLi  71  ij  =  1-  The 
likelihood  function  for  n  individuals  is  a  multinomial  given  by: 

i  =  n2=l(7rii)W<1(7r<2)W2"(’rim)Wim  (13.41) 

This  model  can  be  motivated  by  a  utility  maximization  story  where  the  utility  that  individual 
i  derives  from  say  the  occupational  choice  j  is  denoted  by  Uj  and  is  a  function  of  the  job 
attributes  for  the  z-th  individual,  i.e. ,  some  x^s  like  the  present  value  of  potential  earnings, 
and  training  cost/net  worth  for  that  job  choice  for  individual  i,  see  Boskin  (1974). 

Uij  =  x'ij/3  +  dj  (13.42) 

where  [3  is  a  vector  of  implicit  prices  for  these  occupational  characteristics.  Therefore,  the 
probability  of  choosing  the  first  occupation  is  given  by: 

7T»i  =  Pr[t/ji  >  Ui2,  Un  >  Ut3,  ...,Ua>  Uirn]  (13.43) 

=  Pr[e*2  -  en  <  (xy,  -  x'i2)/3,  ei3  -  e,A 
<  (®ii  -  x'i3)/3,  ■  ■  ■ ,  -  et l  <  (x'n  -  x'im)/3\ 

The  normality  assumption  involves  a  number  of  integrals  but  has  the  advantage  of  not  neces¬ 
sarily  assuming  the  e’s  to  be  independent.  The  more  popular  assumption  computationally  is 
the  multinomial  logit  model.  This  arises  if  and  only  if  the  e’s  are  independent  and  identically 
distributed  as  a  Weibull  density  function,  see  McFadden  (1974).  The  latter  is  given  by  F(z)  = 
exp(— exp(— z)).  The  difference  between  any  two  random  variables  with  a  Weibull  distribution 
has  a  logistic  distribution  A(z)  =  e2/l  +  ez,  giving  the  conditional  logit  model: 

t Tij  =  Pr[z/i  =  j]  =  exp[(xij  -  xim)' P\/{l  +  YJ?=i  vtv[(xij  -  xim)' (3]} 

=  exp [x'ij/3\/ YJj= l  exp [x'^fS]  for  j  =  1,  2, . . . ,  m  -  1 


(13.44) 
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and  7 Tirn  =  Pr [yi  =  m\  =  1/{1  +  YYjLi  exvi(xij  ~  xim)' Wi  =  exp[x'm/3]/ Yl%iex:P[xijP\-  There 
are  two  consequences  of  this  conditional  logit  specification.  The  first  is  that  the  odds  of  any  two 
alternative  occupations,  say  1  and  2  is  given  by 

7m/7Ti2  =  exp[(xii  -  xi2)'/3] 

and  this  does  not  change  when  the  number  of  alternatives  change  from  m  to  m*,  since  the 
denominators  divide  out.  Therefore,  the  odds  are  unaffected  by  an  additional  alternative.  This 
is  known  as  the  independence  of  irrelevant  alternatives  property  and  can  represent  a  serious 
weakness  in  the  conditional  logit  model.  For  example,  suppose  the  choices  are  between  a  pony 
and  a  bicycle,  and  children  choose  a  pony  two-thirds  of  the  time.  Suppose  that  an  additional 
alternative  is  made  available,  an  additional  bicycle  but  of  a  different  color,  then  one  would  still 
expect  two-thirds  of  the  children  to  choose  the  pony  and  the  remaining  one-third  to  split  choices 
among  the  bicycles  according  to  their  color  preference.  In  the  conditional  logit  model,  however, 
the  proportion  choosing  the  pony  must  fall  to  one  half  if  the  odds  relative  to  either  bicycle  is  to 
remain  two  to  one  in  favor  of  the  pony.  This  illustrates  the  point  that  when  two  or  more  of  the 
m  alternatives  are  close  substitutes,  the  conditional  logit  model  may  not  produce  reasonable 
results.  This  feature  is  a  consequence  of  assuming  the  errors  e^-’s  as  independent.  Hausman  and 
McFadden  (1984)  proposed  a  Hausman  type  test  to  check  for  the  independence  of  these  errors. 
They  suggest  that  if  a  subset  of  the  choices  is  truly  irrelevant  then  omitting  it  from  the  model 
altogether  will  not  change  the  parameter  estimates  systematically.  Including  them  if  they  are 
irrelevant  preserves  consistency  but  is  inefficient.  The  test  statistic  is 

q  =  0s  -  PfY [Vs  -  Vf]~10s  -  Pf)  (13.45) 

where  s  indicates  the  estimators  based  on  the  restricted  subset  and  /  denotes  the  estimator 
based  on  the  full  set  of  choices.  This  is  asymptotically  distributed  as  xt^  where  k  is  the  dimension 
of  (3. 

Second,  in  this  specification,  none  of  the  Xjj’s  can  be  constant  across  different  alternatives, 
because  the  corresponding  (3  will  not  be  identified.  This  means  that  we  cannot  include  individual 
specific  variables  that  do  not  vary  across  alternatives  like  race,  sex,  age,  experience,  income, 
etc.  The  latter  type  of  data  is  more  frequent  in  economics,  see  Schmidt  and  Strauss  (1975).  In 
this  case  the  specification  can  be  modified  to  allow  for  a  differential  impact  of  the  explanatory 
variables  upon  the  odds  of  choosing  one  alternative  rather  than  the  other: 

irij  =  Pr  [yi  =  j]  =  exp(x,ijPj)/  YYjL\  exp(.x'h/fJ)  for  j  =  1, . . . ,  m  (13.46) 

where  now  the  parameter  vector  is  indexed  by  j.  If  the  Xij’s  are  the  same  for  every  j,  then 

*ij  =  Pr[yi  =  j]  =  exp (x'iPj)/ YYJLi  exp(x'/3J)  for  j  =  1, . . . ,  m  (13.47) 

This  is  the  model  used  by  Schmidt  and  Strauss  (1975).  A  normalization  would  be  to  take 
/3m  =  0,  in  which  case,  we  get  the  multinomial  logit  model 

7 Urn  =  1/  Y0L i  exP (xiPj)  (13.48) 

and 

irij  =  exp(x'/3j)/[l  +  E^i1exp(x'/3J)]  for  j  =  1, 2, . . . ,  m  -  1.  (13.49) 
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The  likelihood  function,  score  equations,  Hessian  and  information  matrices  are  given  in  Maddala 
(1983,  pp.  36-37). 

Multinomial  Logit  Model.  Terza  (2002)  reconsidered  the  Mullahy  and  Sindelar  (1996)  data 
set  for  problem  drinking  described  in  section  13.9.  However,  Terza  reclassified  the  dependent 
variable  as  follows:  y=l  when  the  individual  is  out  of  the  labor  force,  y=2  when  this  individual 
is  unemployed,  and  y=3  when  this  individual  is  employed.  Table  13.14  runs  a  multinomial 
logit  model  which  replicates  some  of  the  results  in  Terza  (2002,  p.  399)  for  males  using  Stata. 
Although  the  health  variables  are  still  significant  the  problem  drinking  variable  is  not  significant. 
Problem  16  asks  the  reader  to  replicate  these  results  for  females. 


13.11  The  Censored  Regression  Model 

Suppose  one  is  interested  in  the  amount  one  is  willing  to  spend  on  the  purchase  of  a  durable 
good.  For  example,  a  car.  In  this  case,  one  would  observe  the  expenditures  only  if  the  car  is 
bought,  so 

y*  =  x\P  +  Ui  if  y*  >  0  (13.50) 

where  Xi  denotes  a  vector  of  household  characteristics,  such  as  income,  number  of  children  or 
education,  y*  is  a  latent  variable,  in  this  case  the  amount  one  is  willing  to  spend  on  a  car.  We 
observe  y*  =  y*  only  if  y*  >  0  and  we  set  y*  =  0  if  y*  <  0.  The  censoring  at  zero  is  of  course 
arbitrary,  and  the  u/  s  are  assumed  to  be  IIN(0,cr2).  This  is  known  as  the  Tobit  model  after 
Tobin  (1958).  In  this  case,  we  have  censored  observations  since  we  do  not  observe  any  y*  that  is 
negative.  All  we  observe  is  the  fact  that  this  household  did  not  buy  a  car  and  a  corresponding 
vector  Xi  of  this  household’s  characteristics.  Without  loss  of  generality,  we  assume  that  the  first 
n\  observations  have  positive  y*’s  and  the  remaining  no  =  n  —  n±  observations  have  non-positive 
y*’s.  In  this  case,  OLS  on  the  first  n\  observations,  i.e. ,  using  only  the  positive  observed  y*’s 
would  be  biased  since  Ui  does  not  have  zero  mean.  In  fact,  by  omitting  observations  for  which 
y*  <  0  from  the  sample,  one  is  only  considering  disturbances  from  (13.50)  such  that  m  >  —xfl. 
The  distribution  of  these  u/  s  is  a  truncated  normal  density  given  in  Figure  13.2.  The  mean  of 
this  density  is  not  zero  and  is  dependent  on  /?,  a2  and  x\.  More  formally,  the  regression  function 
can  be  written  as: 

E{y*/xi ,  y*  >0)  =  x[f3  +  E[ui/y*  >  0]  =  x\P  +  E[ui/ui  >  -x\0\  (13.51) 

=  x'i/3  +  (J7j  for  %  —  1,2,...,  n\ 

where  ryi  =  p(—Zi)/[  1  —  4>(— Zi)\  and  Zi  =  x\P/o.  See  Greene  (1993,  p.  685)  for  the  moments 
of  a  truncated  normal  density  or  the  Appendix  to  this  chapter.  OLS  on  the  positive  y*’s  omits 
the  second  term  in  (13.51),  and  is  therefore  biased  and  inconsistent. 

A  simple  two-step  can  be  used  to  estimate  (13.51).  First,  we  define  a  dummy  variable  di 
which  takes  the  value  1  if  y*  is  observed  and  0  otherwise.  This  allows  us  to  perform  probit 
estimation  on  the  whole  sample,  and  provides  us  with  a  consistent  estimator  of  (P/a).  Also, 
P[di  =  1]  =  P[y*  >  0]  =  P[ui  >  —x[P\  and  P[di  =  0]  =  P[y*  <  0]  =  P[ui  <  —  x(P\.  Therefore, 
the  likelihood  function  is  given  by 

=  FEU  ®(zi)di[l  -  <$>(zi))l-di  where  zi  =  x'iP/a 


(13.52) 
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Table  13.14  Multinomial  Logit  Results:  Problem  Drinking 


.  mlogit  y  alc90th  ue88  age  agesq  schooling  married  famsize  white  excellent  verygood  good  fair 
northeast  midwest  south  centercity  othermsa  ql  q2  q3,  baseoutcome(l) 


Multinomial  logistic  regression 


Log  likelihood  =  -3217.481 


Number  of  obs 
Wald  chi2(20) 
Prob  >  chi2 
Pseudo  R2 


9822 

1276.47 

0.0000 

0.1655 


y 

Coef. 

Std.  Err. 

z 

P>\z\ 

[95%  Conf.  Interval] 

2 

ale  9  0th 

.1270931 

.21395 

0.59 

0.552 

-.2922412 

.5464274 

ue88 

.0458099 

.051355 

0.89 

0.372 

-.0548441 

.1464639 

age 

.1617634 

.0663205 

2.44 

0.015 

.0317776 

.2917492 

agesq 

-.0024377 

.0007991 

-3.05 

0.002 

-.004004 

-.0008714 

schooling 

-.0092135 

.0245172 

-0.38 

0.707 

-.0572664 

.0388393 

married 

.4004928 

.1927458 

2.08 

0.038 

.022718 

.7782677 

famsize 

.0622453 

.0503686 

1.24 

0.217 

-.0364753 

.1609659 

white 

.0391309 

.1705625 

0.23 

0.819 

-.2951653 

.3734272 

excellent 

2.91833 

.4486757 

6.50 

0.000 

2.038942 

3.797719 

verygood 

2.978336 

.4505932 

6.61 

0.000 

2.09519 

3.861483 

good 

2.493939 

.4446815 

5.61 

0.000 

1.622379 

3.365499 

fair 

1.460263 

.4817231 

3.03 

0.002 

.5161027 

2.404422 

northeast 

.0849125 

.2374365 

0.36 

0.721 

-.3804545 

.5502796 

midwest 

.0158816 

.2037486 

0.08 

0.938 

-.3834583 

.4152215 

south 

.1750244 

.2027444 

0.86 

0.388 

-.2223474 

.5723962 

centercity 

-.2717445 

.1911074 

-1.42 

0.155 

-.6463081 

.1028192 

othermsa 

-.0921566 

.1929076 

-0.48 

0.633 

-.4702486 

.2859354 

ql 

.422405 

.1978767 

2.13 

0.033 

.0345738 

.8102362 

q2 

-.0219499 

.2056751 

-0.11 

0.915 

-.4250657 

.3811659 

q3 

-.0365295 

.2109049 

-0.17 

0.862 

-.4498954 

.3768364 

_cons 

-6.113244 

1.427325 

-4.28 

0.000 

-8.910749 

-3.315739 

3 

ale  9  0th 

-.1534987 

.1395003 

-1.10 

0.271 

-.4269144 

.1199169 

ue88 

-.0954848 

.033631 

-2.84 

0.005 

-.1614004 

-.0295693 

age 

.227164 

.0409884 

5.54 

0.000 

.1468282 

.3074999 

agesq 

-.0030796 

.0004813 

-6.40 

0.000 

-.0040228 

-.0021363 

schooling 

.0890537 

.0152314 

5.85 

0.000 

.0592008 

.1189067 

married 

.7085708 

.1219565 

5.81 

0.000 

.4695405 

.9476012 

famsize 

.0622447 

.0332365 

1.87 

0.061 

-.0028975 

.127387 

white 

.7380044 

.1083131 

6.81 

0.000 

.5257147 

.9502941 

excellent 

3.702792 

.1852415 

19.99 

0.000 

3.339725 

4.065858 

verygood 

3.653313 

.1894137 

19.29 

0.000 

3.282069 

4.024557 

good 

2.99946 

.1786747 

16.79 

0.000 

2.649264 

3.349656 

fair 

1.876172 

.1885159 

9.95 

0.000 

1.506688 

2.245657 

northeast 

.088966 

.1491191 

0.60 

0.551 

-.203302 

.3812341 

midwest 

.1230169 

.1294376 

0.95 

0.342 

-.130676 

.3767099 

south 

.4393047 

.1298054 

3.38 

0.001 

.1848908 

.6937185 

centercity 

-.2689532 

.1231083 

-2.18 

0.029 

-.510241 

-.0276654 

othermsa 

.0978701 

.1257623 

0.78 

0.436 

-.1486195 

.3443598 

ql 

-.0274086 

.1286695 

-0.21 

0.831 

-.2795961 

.224779 

q2 

-.110751 

.126176 

-0.88 

0.380 

-.3580514 

.1365494 

q3 

-.0530835 

.1296053 

-0.41 

0.682 

-.3071052 

.2009382 

_cons 

-6.237275 

.8886698 

-7.02 

0.000 

-7.979036 

-4.495515 

(y==l  is  the  base  outcome) 
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Figure  13.2  Truncated  Normal  Distribution 

and  once  /3/a  is  estimated,  we  substitute  these  estimates  in  zt  and  y,  given  below  (13.51)  to  get 
7j.  The  second  step  is  to  estimate  (13.51)  using  only  the  positive  y*’ s  with  y?;  substituted  for 
7 j.  The  resulting  estimator  of  (3  is  consistent  and  asymptotically  normal,  see  Heckman  (1976, 
1979). 

Alternatively,  one  can  use  maximum  likelihood  procedures  to  estimate  the  Tobit  model.  Note 
that  we  have  two  sets  of  observations:  (i)  the  positive  y*’s  with  yi  =  y*,  for  which  we  can  write 


the  density  function  N (x[(3 ,  a2) ,  and  (ii)  the  non-positive  y*’ s  for  which  we  assign  y*  =  0  with 
probability 

Pr [yi  =  0]  =  Pr[y*  <  0]  =  Pr [m  <  -xtf3\  =  <3>{-xJ3/a)  =  1  -  ^(fa/3/a)  (13.53) 

The  probability  over  the  entire  censored  region  gets  assigned  to  the  censoring  point.  This  allows 
us  to  write  the  following  log-likelihood: 

log£  =  —(l/2)  Xn=i  log(27T£T2)  —  (1/2<t2)  —  x'if3)2  (13.54) 

+  E"=n1+1  loS[!  - 

Differentiating  with  respect  to  (3  and  a 2,  see  Maddala  (1983,  p.  153),  one  gets 

dlogl/d/3  =  YJILibji  ~  x'^Xi/a2  -  EIU1+1  -  $*]  (13.55) 

d\ogl/da 2  =  YlihiVi  ~  x'iP)2/2ai  -  (n1/2a2)  +  YJ!=ni+ 1  4>i^iP/[ 2a'\1  ~  $0]  (13.56) 

where  and  fa  are  evaluated  at  Zi  =  fa/3/a. 

Premultiplying  (13.55)  by  (3' /2a2  and  adding  the  result  to  (13.56),  one  gets 

Vmle  =  YlihiVi  ~  x'iPhi/ni  =  Y{(Yi  -  Xx/3)/nx  (13.57) 

where  Y\  denotes  the  ni  x  1  vector  of  non-zero  observations  on  yj,  X\  is  the  n\  x  k  matrix 
of  values  of  Xi  for  the  non-zero  y/s.  Also,  after  multiplying  throughout  by  a ,  (13.55)  can  be 
written  as: 

-X'olo  +  X[(Y1-X1(3)/o  =  0  (13.58) 


where  Xq  denotes  the  no  x  k  matrix  of  x/s  for  which  y j  is  zero,  70  is  an  no  x  1  vector  of  yds 
=  fa/\\  —  4y]  evaluated  at  Zi  =  x\/3/a  for  the  observations  for  which  yi  =  0.  Solving  (13.58)  one 
gets 


Pmle  =  (X'iXj-'Xfo  -  aiXlXi)-1: V'7o 


(13.59) 
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Note  that  the  first  term  in  (13.59)  is  the  OLS  estimator  for  the  first  ni  observations  for  which 
y*  is  positive. 

One  can  use  the  Newton-Raphson  procedure  or  the  method  of  scoring,  for  the  second  deriva¬ 
tives  of  the  log-likelihood,  see  Maddala  (1983,  pp.  154-156).  These  can  be  computed  with  the 
tobit  command  in  Stata.  Note  that  for  the  Tobit  specification,  both  (3  and  a2  are  identified. 
This  is  contrasted  to  the  logit  and  probit  specifications  where  only  the  ratio  ((3 /a2)  is  identified. 
Wooldridge  (2009,  Chapter  17)  recommends  one  obtain  the  estimates  of  ((3/a2)  from  a  probit 
and  comparing  those  with  the  Tobit  estimates  generated  by  dividing  (3  by  a2 .  If  these  estimates 
are  different  or  have  different  signs,  then  the  Tobit  estimation  may  not  be  appropriate.  Problem 
13.17  illustrates  the  Tobit  estimation  for  married  women  labor  supply  example  using  the  Mroz 
(1987)  data. 

Maddala  warns  that  the  Tobit  specification  is  not  necessarily  the  right  specification  every  time 
we  have  zero  observations.  It  is  applicable  only  in  those  cases  where  the  latent  variable  can, 
in  principle,  take  negative  values  and  the  observed  zero  values  are  a  consequence  of  censoring 
and  non-observability.  In  fact,  one  cannot  have  negative  expenditures  on  a  car,  negative  hours 
of  work  or  negative  wages.  However,  one  can  enter  employment  and  earn  wages  when  one’s 
observed  wage  is  larger  than  the  reservation  wage.  Let  y*  be  the  difference  between  observed 
wage  and  reservation  wage.  Only  if  y*  is  positive  will  wages  be  observed.  Final  warning:  The 
Tobit  specification  is  heavily  reliant  on  the  normality  and  homoskedasticity  assumptions.  Failure 
of  these  assumptions  leads  to  misleading  inference. 


13.12  The  Truncated  Regression  Model 


The  truncated  regression  model  excludes  or  truncates  some  observations  from  the  sample.  For 
example,  in  studying  poverty  we  exclude  the  rich,  say  with  earnings  larger  than  some  upper 
limit  yu  from  our  sample.  The  sample  is  therefore  not  random  and  applying  least  squares  to 
the  truncated  sample  lead  to  biased  and  inconsistent  results,  see  Figure  13.3.  This  differs  from 
censoring.  In  the  latter  case,  no  data  is  excluded.  In  fact,  we  observe  the  characteristics  of  all 
households  even  those  that  do  not  actually  purchase  a  car.  The  truncated  regression  model  is 
given  by 

y*  =  x'i/3  +  Ui  i  =  l,2,...,n  with  ut  ~IIN(0,ct2)  (13.60) 


where  y*  is  for  example  earnings  of  the  z-th  household  and  Xi  contains  determinants  of  earnings 
like  education,  experience,  etc.  The  sample  contains  observations  on  individuals  with  y?  <  yu. 
The  probability  that  y*  will  be  observed  is 

Pr [y*  <  yu)  =  Pr [x'i/3  +  Ui<  yu }  =  Pr [m  <yu  -  x'/?]  =  $(-(y“  -  x-/3))  (13.61) 

a 

In  addition,  using  the  results  of  a  truncated  normal  density,  see  Greene  (1993,  p.  685) 


E(ui/y*  <  yu ) 


-a(j)((yu  -  x'i(3)/a ) 
4>((y“  -  x((3)  / a) 


(13.62) 


which  is  not  necessarily  zero.  From  (13.60)  one  can  see  that  E(y* jy*  <  yu )  =  x((3  +  E(ui/y*  < 
yu ).  Therefore,  OLS  on  (13.60)  using  the  observed  y*  is  biased  and  inconsistent  because  it 
ignores  the  term  in  (13.62). 
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Figure  13.3  Truncated  Regression  Model 


The  density  of  y*  is  normal  but  its  total  area  is  given  by  (13.61).  A  proper  density  function 
has  to  have  an  area  of  1.  Therefore,  the  density  of  y*  conditional  on  y*  <  yu  is  simply  the 
conditional  density  of  y*  restricted  to  values  of  y *  <  yu  divided  by  the  Pr  [y*  <  yu ],  see  the 
Appendix  to  this  chapter: 


f(Vi) 


H(y*  -  x'iP/°) 

<J(5>{{yu  -  x'if3)/a) 

0  otherwise 


if 


*  / 

Vi  <  y 


(13.63) 


The  log-likelihood  function  is  therefore 

!og^  =  -|log2vr  -  |logcj2  -  ^  E"=i (v*  ~  xiP)2 


-Er=iiog^> 


yu  ~  <(5 


a 


(13.64) 


It  is  the  last  term  which  makes  MLE  differ  from  OLS  on  the  observed  sample.  Hausman  and 
Wise  (1977)  applied  the  truncated  regression  model  to  data  from  the  New  Jersey  negative- 
income-tax  experiment  where  families  with  incomes  higher  than  1.5  times  the  1967  poverty  line 
were  eliminated  from  the  sample. 


13.13  Sample  Selectivity 

In  labor  economics,  one  observes  the  market  wages  of  individuals  only  if  the  worker  participates 
in  the  labor  force.  This  happens  when  the  worker’s  market  wage  exceeds  his  or  her  reservation 
wage.  In  a  study  of  earnings,  one  does  not  observe  the  reservation  wage  and  for  non-participants 
in  the  labor  force  we  record  a  zero  market  wage.  This  sample  is  censored  because  we  observe 
the  characteristics  of  these  non-labor  participants.  If  we  restrict  our  attention  to  labor  market 
participants  only,  then  the  sample  is  truncated.  This  example  needs  special  attention  because 
the  censoring  is  not  based  directly  on  the  dependent  variable,  as  in  section  13.11.  Rather,  it  is 
based  on  the  difference  between  market  wage  and  reservation  wage.  This  latent  variable  which 
determines  the  sample  selection  is  correlated  with  the  dependent  variable.  Hence,  least  squares 
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on  this  model  results  in  selectivity  bias ,  see  Heckman  (1976,  1979).  A  sample  generated  by  this 
type  of  self-selection  may  not  represent  the  true  population  distribution  no  matter  how  big 
the  sample  size.  However,  one  can  correct  for  self-selection  bias  if  the  underlying  sampling 
generating  process  can  be  understood  and  relevant  identification  conditions  are  available,  see 
Lee  (2001)  for  a  excellent  summary  and  the  references  cited  there.  In  order  to  demonstrate  this, 
let  the  earnings  equation  be  given  by 

w*  =  x'u/3  +  Ui  i  =  1,  2, . . . ,  n  (13.65) 

and  the  labor  participation  (or  selection)  equation  be  given  by 

y*  =  x2fi  +  vt  i  =  1,2, . . .  ,n  (13.66) 


where  m  and  Vi  are  bivariate  normally  distributed  with  mean  zero  and  variance-covariance 

(13.67) 


var 


( Ui)  = 

■  a2 

pa 

\  Vi  ) 

pa 

1 

Normalizing  the  variance  of  Vi  to  be  1  is  not  restrictive,  since  only  the  sign  of  y*  is  observed. 
In  fact,  we  only  observe  Wi  and  yi  where 


Wi  =  wi 

=  0 


if  y*>  0 

otherwise 


(13.68) 


and 

Vi  =  1  if  V*  >  0 

=  0  otherwise 

We  observe  (yi  =  0 ,Wi  =  0)  and  (yi  =  1,  ic?;  =  w* )  only.  The  log-likelihood  for  this  model  is 


Ew=olos  Pr[^  =  0]  +  £jk=i  fog  Pl'[yi  =  1  ]/K*M  =  1)  (13.69) 

where  f(w* /yi  =  1)  is  the  conditional  density  of  w*  given  that  yt  =  1.  The  second  term  can  also 
be  written  as  £yi=ilog  Pr [yi  =  l/w*]f(w*)  which  is  another  way  of  factoring  the  joint  density 
function.  f(w /)  is  in  fact  a  normal  density  with  conditional  mean  x\l[3  and  variance  cr2.  Using 
properties  of  the  bivariate  normal  density,  one  can  write 

Vi  =  4*7  +  P  (“(4  ~  x'hP)  \  +  ei  (13.70) 

where  e 

i  IIN(0,cr2(l  —  p2)).  Therefore, 

Prte  =  1]  =  PrM  >01  =  4.  f  em\  (13.71) 

where  Wi  has  been  substituted  for  w*  since  yi  =  1.  The  likelihood  function  in  (13.69)  becomes 
£yi= o  log(^(-4i7))  +  Ew= i  log  (^<Kv>i  ~  ^uP))  (13.72) 

x'2il  +  P((wj  -x,liP)/c 


+  Ew=i  !°g$ 
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MLE  may  be  computationally  burdensome.  Heckman  (1976)  suggested  a  two-step  procedure 
which  is  based  on  rewriting  (13.65)  as 

w*  =  fix'n  +  pavi  +  7?j  (13.73) 


and  replacing  w*  by  Wi  and  Vi  by  its  conditional  mean  E[vi/yi  =  1].  Using  the  results  on 
truncated  density,  this  conditional  mean  is  given  by  4>(x 2il) /^(~x2il)  known  also  as  the  inverse 
Mills  ratio.  Hence,  (13.73)  becomes 


Wi  = 


x'uP  +  per 


^(^7) 


+  residual 


(13.74) 


Heckman’s  (1976)  two-step  estimator  consists  of  (i)  running  a  probit  on  (13.66)  in  the  first  step 
to  get  a  consistent  estimate  of  7,  (ii)  substituting  the  estimated  inverse  Mills  ratio  in  (13.74) 
and  running  OLS.  Since  a  is  positive,  this  second  stage  regression  provides  a  test  for  sample 
selectivity,  i.e. ,  for  p  =  0,  by  checking  whether  the  f-statistic  on  the  estimated  inverse  Mills  ratio 
is  significant.  This  statistic  is  asymptotically  distributed  as  JV(0, 1).  Rejecting  Ha  implies  there 
is  a  selectivity  problem  and  one  should  not  rely  on  OLS  on  (13.65)  which  ignores  the  selectivity 
bias  term  in  (13.74).  Davidson  and  MacKinnon  (1993)  suggest  performing  MLE  using  (13.72) 
rather  than  relying  on  the  two-step  results  in  (13.74)  if  the  former  is  not  computationally  bur¬ 
densome.  Note  that  the  Tobit  model  for  car  purchases  given  in  (13.50)  can  be  thought  of  as 
a  special  case  of  the  sample  selectivity  model  given  by  (13.65)  and  (13.66).  In  fact,  the  Tobit 
model  assumes  that  the  selection  equation  (the  decision  to  buy  a  car)  and  the  car  expenditure 
equation  (conditional  on  the  decision  to  buy)  are  identical.  Therefore,  if  one  thinks  that  the 
specification  of  the  selection  equation  is  different  from  that  of  the  expenditure  equation,  then 
one  should  not  use  the  Tobit  model.  Instead,  one  should  proceed  with  the  two  equation  sam¬ 
ple  selectivity  model  discussed  in  this  section.  It  is  also  important  to  emphasize  that  for  the 
censored,  truncated  and  sample  selectivity  models,  normality  and  homoskedasticity  are  crucial 
assumptions.  Suggested  tests  for  these  assumptions  are  given  in  Bera,  Jarque  and  Lee  (1984), 
Lee  and  Maddala  (1985)  and  Pagan  and  Vella  (1989).  Alternative  estimation  methods  that  are 
more  robust  to  violations  of  normality  and  heteroskedasticity  include  symmetrically  trimmed 
least  squares  for  Tobit  models  and  least  absolute  deviations  estimation  for  censored  regression 
models.  These  were  suggested  by  Powell  (1984,  1986). 


Notes 

1.  This  is  based  on  Davidson  and  MacKinnon  (1993,  pp.  523-526). 

2.  A  binary  response  model  attempts  to  explain  a  zero-one  (or  binary)  dependent  variable. 

3.  One  should  not  use  nR2  as  the  test  statistic  because  the  total  sum  of  squares  in  this  case  is  not  n. 

Problems 

1.  The  Linear  Probability  Model. 

(a)  For  the  linear  probability  model  described  in  (13.1),  show  that  for  E[ui )  to  equal  zero,  we 
must  have  Pr  [y*  =  1]  =  xt/3. 

(b)  Show  that  Ui  is  heteroskedastic  with  var(iq)  =  a:'/3(l  —  a;'/3). 
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2.  Consider  the  general  log-likelihood  function  given  in  (13.16).  Assume  that  all  the  regression  slopes 
are  zero  except  for  the  constant  a. 

(a)  Show  that  maximizing  \og£  with  respect  to  the  constant  a  yields  F(a)  =  y ,  where  y  is  the 
proportion  of  the  sample  with  yi  =  1. 

(b)  Conclude  that  the  value  of  the  maximized  likelihood  is  log£r  =  n[y\ogy  +  (1  —  y)log(l  —  y)\. 

(c)  Verify  that  for  the  union  participation  example  in  section  13.9  that  log lr  =  —390.918. 

3.  For  the  union  participation  example  considered  in  section  13.9: 

(a)  Replicate  Tables  13.3  and  13.5. 

(b)  Using  the  measures  of  fit  considered  in  section  13.8,  compute  R\ ,  i?|,  ■  ■  • ,  for  the  logit  and 
probit  models. 

(c)  Compute  the  predicted  value  for  the  10th  observation  using  OLS,  logit  and  probit  models. 
Also,  the  corresponding  standard  errors. 

(d)  The  industry  variable  (IND)  was  not  significant  in  all  models.  Drop  this  variable  and  run 
OLS,  logit  and  probit.  How  do  the  results  change?  Compare  with  Tables  13.3  and  13.5. 

(e)  Using  the  model  results  in  part  (d),  test  that  the  slope  coefficients  are  all  zero  for  the  logit, 
probit,  and  linear  probability  models. 

(f)  Test  that  the  coefficients  of  IND,  FEM  and  BLK  in  Table  13.3  are  jointly  insignificant  using 
a  LR  test,  a  Wald  test  and  a  BRMR  using  OLS,  logit  and  probit. 

4.  For  the  data  used  in  the  union  participation  example  in  section  13.9: 

(a)  Run  OLS,  logit  and  probit  using  as  the  dependent  variable  OCC  which  is  one  if  the  individual 
is  in  a  blue-collar  occupation,  and  zero  otherwise.  For  the  independent  variables  use  ED, 
WKS,  EXP,  SOUTH,  SMSA,  IND,  MS,  FEM  and  BLK.  Compare  the  coefficient  estimates 
across  the  three  models.  What  variables  are  significant? 

(b)  Using  the  measures  of  fit  considered  in  section  13.8,  compute  R\,  . . . ,  for  the  logit  and 

probit  models. 

(c)  Tabulate  the  actual  versus  predicted  values  for  OCC  from  all  three  model  results,  like  Ta¬ 
ble  13.5  for  Union.  What  is  the  proportion  of  correct  decisions  for  OLS,  logit  and  probit? 

(d)  Test  that  the  slope  coefficients  are  all  zero  for  the  logit,  probit,  and  linear  probability  models. 

5.  Truncated  Uniform  Density.  Let  a:  be  a  uniformly  distributed  random  variable  with  density 

f(x )  =  ^  for  —  1  <  x  <  1 

(a)  What  is  the  density  function  of  f(x/x  >  —1/2)?  Hint:  Use  the  definition  of  a  conditional 
density 

f  {x/x  >  —1/2)  =  /(x)/Pr[x  >  —1/2]  for  —  ^  <  x  <  1. 

(b)  What  is  the  conditional  mean  E(x/x  >  —1/2)?  How  does  it  compare  with  the  unconditional 
mean  of  x?  Note  that  because  we  truncated  the  density  from  below,  the  new  mean  should 
shift  to  the  right. 

(c)  What  is  the  conditional  variance  var(x/x  >  —1/2)?  How  does  it  compare  to  the  unconditional 
var(x)?  (Truncation  reduces  the  variance). 

6.  Truncated  Normal  Density.  Let  x  be  iV(l,l).  Using  the  results  in  the  Appendix,  show  that: 
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(a)  The  conditional  density  f(x/x  >  1)  =  2(f>(x  —  1)  for  x  >  1  and  f(x/x  <  1)  =  2 (f>(x  —  1)  for 
x  <  1. 

(b)  The  conditional  mean  E(x/x  >  1)  =  1  +  2(f>(0)  and  E(x/x  <  1)  =  1  —  2<f>(0).  Compare  with 
the  unconditional  mean  of  x. 

(c)  The  conditional  variance  var(x/x  >  1)  =  var(x/x  <  1)  =  1  —  4<^>2(0).  Compare  with  the 
unconditional  variance  of  x. 

7.  Censored  Normal  Distribution.  This  is  based  on  Greene  (1993,  pp.  692-693).  Let  y*  be  N(fi,a2) 
and  define  y  =  y*  if  y*  >  c  and  y  =  c  if  y*  <  c  for  some  constant  c. 


(a)  Verify  the  E(y)  expression  given  in  (A. 7). 

(b)  Derive  the  var(y)  expression  given  in  (A. 8).  Hint:  Use  the  fact  that 

var (y)  =  U (conditional  variance)  +  var (conditional  mean) 


and  the  formulas  given  in  the  Appendix  for  conditional  and  unconditional  means  of  a  trun¬ 
cated  normal  random  variable. 

<7(j)(n/a) 


(c)  For  the  special  case  of  c  =  0,  show  that  (A. 7)  simplifies  to  E(y)  =  $(/x/ct) 
and  (A. 8)  simplifies  to 


M ' 


<f>(li/a) 


var (y)  =  a2( h  ^ 


1-6 


$(/&) 

$(X) 


B) 


$  - 


where  6 


BX) 

X/x/cr)  y 

U ) 

$(X) 

BOX)  a. 

.  Similar  expressions  can  be  derived  for  censoring 


of  the  upper  part  rather  than  the  lower  part  of  the  distribution. 


8.  Fixed  and  Adjustable  Rate  Mortgages.  Dhillon,  Shilling  and  Sirmans  (1987)  considered  the  eco¬ 
nomic  decision  of  choosing  between  fixed  and  adjustable  rate  mortgages.  The  data  consisted  of  78 
households  borrowing  money  from  a  Louisiana  mortgage  banker.  Of  these,  46  selected  fixed  rate 
mortgages  and  32  selected  uncapped  adjustable  rate  mortgages.  This  data  set  can  be  downloaded 
from  the  Springer  web  site  and  is  labelled  DHILLON. ASC.  It  was  obtained  from  Lott  and  Ray 
(1992).  These  variables  include: 


Y 

BA 

BS 

NW 

FI 

PTS  = 
MAT  = 
MOB  = 
MC 

FTB  = 
SE 

YLD  = 
MARG  = 
CB 

STL  = 
LA 


0  if  adjustable  rate  and  1  if  fixed  rate. 

Age  of  the  borrower. 

Years  of  schooling  for  the  borrower. 

Net  worth  of  the  borrower. 

Fixed  interest  rate. 

Ratio  of  points  paid  on  adjustable  to  fixed  rate  mortgages. 

Ratio  of  maturities  on  adjustable  to  fixed  rate  mortgages. 

Years  at  the  present  address. 

1  if  the  borrower  is  married  and  0  otherwise. 

1  if  the  borrower  is  a  first-time  home  buyer  and  0  otherwise. 

1  if  the  borrower  is  self-employed  and  0  otherwise. 

The  difference  between  the  10-year  treasury  rate  less  the  1-year  treasury  rate. 
The  margin  on  the  adjustable  rate  mortgage. 

1  if  there  is  a  co-borrower  and  0  otherwise. 

Short-term  liabilities. 

Liquid  assets. 
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The  probability  of  choosing  a  variable  rate  mortgage  is  a  function  of  personal  borrower  charac¬ 
teristics  as  well  as  mortgage  characteristics.  The  efficient  market  hypothesis  state  that  only  cost 
variables  and  not  personal  borrower  characteristics  influence  the  borrower’s  decision  between  fixed 
and  adjustable  rate  mortgages.  Cost  variables  include  FI,  MARG,  YLD,  PTS  and  MAT.  The  rest 
of  the  variables  are  personal  characteristics  variables.  The  principal  agent  theory  suggests  that  in¬ 
formation  is  asymmetric  between  lender  and  borrower.  Therefore,  one  implication  of  this  theory  is 
that  the  personal  characteristics  of  the  borrower  will  be  significant  in  the  choice  of  mortgage  loan. 

(a)  Run  OLS  of  Y  on  all  variables  in  the  data  set.  For  this  linear  probability  model  what  does 
the  F-statistic  for  the  significance  of  all  slope  coefficients  yield?  What  is  the  R2?  How  many 
predictions  are  less  than  zero  or  larger  than  one? 

(b)  Using  only  the  cost  variables  in  the  restricted  regression,  test  that  personal  characteristics 
are  jointly  insignificant.  Hint:  Use  the  Chow-F  statistic.  Do  you  find  support  for  the  efficient 
market  hypothesis? 

(c)  Run  the  above  model  using  the  logit  specification.  Test  the  efficient  market  hypothesis.  Does 
your  conclusion  change  from  that  in  part  (b)?  Hint:  Use  the  likelihood  ratio  test  or  the 
BRMR. 

(d)  Do  part  (c)  using  the  probit  specification. 

9.  Sampling  Distribution  of  OLS  Under  a  Logit  Model.  This  is  based  on  Baltagi  (2000). 

Consider  a  simple  logit  regression  model 

Vt  =  A  +  ut 

for  t  =  1,2,  where  A(z)  =  ez/(l  +  ez)  for  — oo  <  z  <  oo.  Let  (3  =  1,  x\  =  1,  Xi  =  2  and  assume 
that  the  uf  s  are  independent  with  mean  zero. 

(a)  Derive  the  sampling  distribution  of  the  least  squares  estimator  of  /3,  i.e. ,  assuming  a  linear 
probability  model  when  the  true  model  is  a  logit  model. 

(b)  Derive  the  sampling  distribution  of  the  least  squares  residuals  and  verify  the  estimated  vari¬ 
ance  of  (3ols  is  biased. 

10.  Sample  Selection  and  Non-response.  This  is  based  on  Manski  (1995),  see  the  Appendix  to  this 
chapter.  Suppose  we  are  interested  in  estimating  the  probability  than  an  individual  who  is  home¬ 
less  at  a  given  date  has  a  home  six  months  later.  Let  y  —  1  if  the  individual  has  a  home  six  months 
later  and  y  =  0  if  the  individual  remains  homeless.  Let  x  be  the  sex  of  the  individual  and  let  z  =  1 
if  the  individual  was  located  and  interviewed  and  zero  otherwise.  100  men  and  31  women  were 
initially  interviewed.  Six  months  later,  only  64  men  and  14  women  were  located  and  interviewed. 
Of  the  64  men,  21  exited  homelessness.  Of  the  14  women  only  3  exited  homelessness. 

(a)  Compute  Pr [y  =  1/Male,  z  =  1],  Pr[2  =  1/Male]  and  the  bound  on  Pr[y  =  1/Male]. 

(b)  Compute  Pr [y  =  1/Female,  2  =  1],  Pr [z  =  1/Female]  and  the  bound  on  Pr[y  =  1/Female]. 

(c)  Show  that  the  width  of  the  bound  is  equal  to  the  probability  of  attrition.  Which  bound  is 
tighter?  Why? 

11.  Does  the  Link  Matter?  This  is  based  on  Santos  Silva  (1999).  Consider  a  binary  random  variable 
Yi  such  that 


P(Yi  =  l\x)  =  F((30+ fiiXi),  i  =  1, . . .  ,n, 


where  the  link  F(-)  is  a  continuous  distribution  function. 


366 


Chapter  13:  Limited  Dependent  Variables 


(a)  Write  down  the  log-likelihood  function  and  the  first-order  conditions  of  maximization  with 
respect  to  f30  and  f31. 

(b)  Consider  the  case  where  Xi  only  assumes  two  different  values,  without  loss  of  generality,  let 

it  be  0  and  1.  Show  that  F(l)  =  XX j/j/m,  where  n\  is  the  number  of  observations  for 
which  Xi  =  1.  Also,  show  that  F( 0)  =  =0  Vi/{n  —  ni). 

(c)  What  are  the  maximum  likelihood  estimates  of  /30  and  /3  ft 

(d)  Show  that  the  value  of  the  log-likelihood  function  evaluated  at  the  maximum  likelihood 
estimates  of  /30  and  /51  is  the  same,  independent  of  the  form  of  the  link  function. 

12.  Beer  Taxes  and  Motor  Vehicle  Fatality.  Ruhm  (1996)  considered  the  effect  of  beer  taxes  and  a 
variety  of  alcohol-control  policies  on  motor  vehicle  fatality  rates,  see  section  13.4.  The  data  is 
for  48  states  (excluding  Alaska,  Hawaii  and  the  District  of  Columbia)  over  the  period  1982-1988. 
This  data  set  can  be  downloaded  from  the  Stock  and  Watson  (2003)  web  site  at  www.aw.com/ 
stock_watson.  Using  this  data  set  replicate  the  results  in  Table  13.1. 

13.  Problem  Drinking  and  Employment.  Mullahy  and  Sindelar  (1996)  considered  the  effect  of  problem 
drinking  on  employment  and  unemployment.  The  data  set  is  based  on  the  1988  Alcohol  Sup¬ 
plement  of  the  National  Health  Interview  Survey.  This  can  be  downloaded  from  the  Journal  of 
Applied  Econometrics  web  site  at  http://qed.econ. queensu.ca/jae/2002-vl7.4/terza/. 

(a)  Replicate  the  probit  results  in  Table  13.6  and  run  also  the  logit  and  OLS  regressions  with  ro¬ 
bust  White  standard  errors.  The  OLS  results  should  match  those  given  in  Table  5  of  Mullahy 
and  Sindelar  (1996). 

(b)  Compute  marginal  effects  as  reported  in  Table  13.7  and  average  marginal  effects  as  reported 
in  Table  13.8.  Compute  the  classifications  of  actual  vs  predicted  as  reported  in  Table  13.9. 
Repeat  these  calculations  for  OLS  and  logit. 

(c)  Mullahy  and  Sindelar  (1996)  performed  similar  regressions  for  females  and  for  the  dependent 
variable  taking  the  value  of  1  if  the  individual  is  unemployed  and  zero  otherwise.  Replicate 
the  OLS  results  in  Tables  5  and  6  of  Mullahy  and  Sindelar  (1996)  and  perform  the  corre¬ 
sponding  logit  and  probit  regressions.  Repeat  part  (b)  for  the  female  data  set.  What  is  your 
conclusion  on  the  relationship  between  problem  drinking  and  unemployment? 

14.  Fractional  Response.  Papke  and  Wooldridge  (1996)  studied  the  effect  of  match  rates  on  partic¬ 
ipation  rates  in  401  (K)  pension  plans.  The  data  are  from  the  1987  IRS  Form  5500  reports  of 
pension  plans  with  more  than  100  participants.  This  data  set  can  be  downloaded  from  the  Journal 
of  Applied  Econometrics  web  site  at  http://qed.econ.queensu.ca/jae/1996-Vll. 6/papke. 

(a)  Replicate  Tables  I  and  II  of  Papke  and  Wooldridge  (1996). 

(b)  Run  the  specification  tests  (RESET)  described  in  Papke  and  Wooldridge  (1996). 

(c)  Compare  OLS  and  logit  QMLE  using  R2,  specification  tests,  and  predictions  for  various 
values  of  MRATE  as  done  in  Figure  1  of  Papke  and  Wooldridge  (1996,  p.  630). 

15.  Fertility  and  Female  Labor  Supply.  Carrasco  (2001)  estimated  a  probit  equation  for  fertility  using 
PSID  data  over  the  period  1986-1989,  see  Table  13.10.  The  sample  consists  of  1,442  married  or 
cohabiting  women  between  the  ages  of  18  and  55  in  1986.  The  data  set  can  be  obtained  from  the 
Journal  of  Business  &  Economic  Statistics  archive  data  web  site. 

(a)  Replicate  Table  4,  columns  1  and  2,  of  Carrasco  (2001,  p.  391).  Show  that  having  children  of 
the  same  sex  has  a  significant  and  positive  effect  on  the  probability  of  having  an  additional 
child. 

(b)  Compute  the  predicted  probabilities  from  these  regression  and  the  percentage  of  correct 
decisions.  Also  compute  the  marginal  effects. 
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(c)  Replicate  Table  5,  columns  1  and  4,  of  Carrasco  (2001,  p.  392)  which  run  a  female  labor  force 
participation  equation  using  OLS  and  probit.  Compute  the  predicted  probabilities  from  the 
probit  regression  and  the  percentage  of  correct  decisions.  Also  compute  the  marginal  effects. 

(d)  Replicate  Table  5,  column  5,  of  Carrasco  (2001,  p.  392)  which  runs  2sls  on  the  female  labor 
force  participation  equation  using  as  instruments  the  same  sex  variables  and  their  interactions 
with  ags261.  Compute  the  over-identification  test  for  this  2sls. 

(e)  Replicate  Table  7,  column  4,  of  Carrasco  (2001,  p.  393)  which  runs  fixed  effects  on  the  fe¬ 
male  labor  force  participation  equation  with  robust  standard  errors.  Run  also  fixed  effects 
2sls  using  as  instruments  the  same  sex  variables  and  their  interactions  with  ags261. 

16.  Multinomial  Logit.  Terza  (2002)  ran  a  multinomial  logit  model  on  the  Mullahy  and  Sindelar  (1996) 
data  set  for  problem  drinking  described  in  problem  13,  by  explicitly  accounting  for  the  multinomial 
classification  of  the  dependent  variable.  In  particular,  y  —  1  when  the  individual  is  out  of  the  labor 
force,  y  =  2  when  this  individual  is  unemployed,  and  y  =  3  when  this  individual  is  employed. 

(a)  Replicate  the  multinomial  logit  estimates  for  the  male  data  reported  in  Table  II  of  Terza 
(2002,  p.  399)  columns  3,4,  9  and  10. 

(b)  Obtain  the  multinomial  logit  estimates  for  the  female  data.  How  does  your  conclusion  on  the 
relationship  between  problem  drinking  and  employment /unemployment  change  from  that  in 
problem  13? 

17.  Tobit  Estimation  of  Married  Women  Labor  Supply.  Wooldridge  (2009,  p.  593)  estimated  a  Tobit 
equation  for  the  Mroz  (1987)  data  considered  in  problem  11.31.  Using  the  PSID  for  1975,  Mroz’s 
sample  consists  of  753  married  white  women  between  the  ages  of  30  and  60  in  1975,  with  428  work¬ 
ing  at  some  time  during  the  year.  The  wife’s  annual  hours  of  work  (hours)  is  regressed  on  nonwife 
income  ( nwifeinc );  the  wife’s  age  ( age)7  her  years  of  schooling  ( educ),  the  number  of  children  less 
than  six  years  old  in  the  household  ( kidslt6 ),  and  the  number  of  children  between  the  ages  of  five 
and  nineteen  ( kidsge6 ).  The  data  set  was  obtained  from  Wooldridge’s  (2009)  data  web  site. 

(a)  Give  a  detailed  summary  of  hours  of  work  and  determine  the  extent  of  skewness  and  kurtosis. 

(b)  Run  OLS  and  Tobit  estimation  as  described  above  and  replicate  Table  17.2  of  Wooldridge 
(2009,  p.  593). 

(c)  Using  the  variable  in  the  labor  force  ( inlf ),  run  OLS,  logit  and  probit  using  the  same  ex¬ 
planatory  variables  given  above  and  replicate  Table  17.1  of  Wooldridge  (2009,  p.  585). 

Give  the  predicted  classification  and  the  marginal  effects  at  the  mean  as  well  as  the  average 
marginal  effects  for  these  three  specifications. 

(d)  Compare  the  estimates  of  (/ 3/o 2)  from  the  probit  in  part  (c)  with  the  Tobit  estimates  given 
in  part  (b). 

18.  Heckit  Estimation  of  Married  Women’s  Earnings.  Wooldridge  (2009,  p.  611)  estimated  a  log  wage 
equation  for  the  Mroz  (1987)  data  considered  in  problem  13.7.  The  wife’s  log  wage  (lwage)  is 
regressed  on  her  years  of  schooling  ( educ),  her  experience  ( exper )  and  its  square  ( expersq ).  The 
probit  equation  to  correct  for  sample  selection  includes  the  regressors  plus  the  following  additional 
variables:  the  number  of  children  less  than  six  years  old  in  the  household  ( kidslt6 ),  the  number  of 
children  between  the  ages  of  five  and  nineteen  ( kidsge6 ),  nonwife  income  ( nwifeinc )  and  the  wife’s 
age  (age). 

(a)  Run  OLS  and  Heckit  estimation  as  described  above  and  replicate  Table  17.5  of  Wooldridge 
(2009,  p.  611). 

(b)  Test  that  the  inverse  Mills  ratio  is  not  significant.  What  do  you  conclude? 

(c)  Run  the  MLE  of  this  Heckman  (1976)  sample  selection  model. 
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Appendix 


1.  Truncated  Normal  Distribution 

Let  x  be  N{y,a2),  then  for  a  constant  c,  the  truncated  density  is  given  by 
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where  < j>(z)  denotes  the  p.d.f.  and  denotes  the  c.d.f.  of  a  IV(0, 1)  random  variable.  If  the 
truncation  is  from  above 
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In  other  words,  the  truncated  mean  shifts  to  the  right  (left)  if  truncation  is  from  below  (above). 
The  conditional  variances  are  given  by  <r2(l  —  S(c*))  with  0  <  8(c*)  <  1  for  all  values  of  c*. 
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In  other  words,  the  truncated  variance  is  always  less  than  the  unconditional  or  untruncated 
variance.  For  more  details,  see  Maddala  (1983,  p.  365)  or  Greene  (1993,  p.  685). 


2.  The  Censored  Normal  Distribution 


Let  y*  be  a2),  then  for  a  constant  c,  define  y  =  y*  if  y*  >  c  and  y  =  c  if  y*  <  c.  Unlike  the 
truncated  normal  density,  the  censored  density  assigns  the  entire  probability  of  the  censored 
region  to  the  censoring  point,  i.e.,  y  =  c.  So  that  Pr[y  =  c]  =  Pr[y*  <  c]  =  3>((c  —  y)/cr)  =  <L(c*) 
where  c*  =  (c  —  y)/a.  For  the  uncensored  region  the  probability  of  y*  remains  the  same  and 
can  be  obtained  from  the  normal  density. 

It  is  easy  to  show,  see  Greene  (1993,  p.  692)  that 
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where  E(y* /y*  >  c)  is  obtained  from  the  mean  of  a  truncated  normal  density,  see  (A. 3). 
Similarly,  one  can  show,  see  problem  7  or  Greene  (1993,  p.  693)  that 


var(y)  =  a2  [1  -  4>(c*)] 


l-6{c*)+(c*- 
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where  S(c*)  was  defined  in  (A. 5). 


3.  Sample  Selection  and  Non-response 

Non-response  is  a  big  problem  plaguing  survey  data.  Some  individuals  refuse  to  respond  and 
some  do  not  answer  all  the  questions,  especially  on  relevant  economic  variables  like  income. 
Suppose  we  interviewed  randomly  150  individuals  upon  their  graduation  from  high  school. 
Among  these,  50  were  female  and  100  were  male.  A  year  later,  we  try  to  re-interview  these 
individuals  to  find  out  whether  they  are  employed  or  not.  Only  70  out  of  100  males  and  40 
out  of  50  females  were  located  and  interviewed  a  year  later.  Out  of  those  re-interviewed,  60 
males  and  20  females  were  found  to  be  employed.  Let  y  =  1  if  the  individual  is  employed  and 
zero  if  not.  Let  x  be  the  sex  of  this  individual  and  let  z  =  1  if  this  individual  is  located  and 
interviewed  a  year  later  and  zero  otherwise. 

Conditioning  on  sex  of  the  respondent  one  can  compute  the  probability  of  being  employed  a 
year  after  high  school  graduation  as  follows: 

Pr[y  =  l/x\=  Pr[y  =  1/x,  z  =  1]  Pr [z  =  1/x]  +  Pr[y  =  1/x,  2  =  0]  Pr[z  =  D/x] 

In  this  case,  Pr[y  =  1/Male,  2  =  1]  =  60/70,  Pr[2  =  1/Male]  =  70/100  and  Pr[2  =  0/Male]  = 
30/100.  But  the  sampling  process  is  uninormative  about  the  non-respondents  or  the  censored 
observations,  i.e.,  Pr[y  =  1/Male,  2  =  0].  Therefore,  in  the  absence  of  other  information 

Pr[y  =  1/Male]  =  (0.6)  +  (0.3)  Pr[y  =  1/Male,  2  =  0] 

Manski  (1995)  argues  that  one  can  estimate  bounds  on  this  probability.  In  fact,  replacing  0  < 
Pr[y  =  1/Male,  2  =  0]  <  1  by  its  bounds,  yields 

0.6  <  Pr [y  =  1/Male]  <  0.9 

with  the  width  of  the  bound  equal  to  the  probability  of  non-response  conditioning  on  Males, 
i.e.,  Pr[2  =  0/Male]  =  0.3.  Similarly,  0.4  <  Pr[y  =  1/Female]  <  0.6  with  the  width  of  the  bound 
equal  to  the  probability  of  non-response  conditioning  on  Females,  i.e.,  Pr[2  =  0/Female]  = 
10/50  =  0.2.  Manski  (1995)  argues  that  these  bounds  are  informative  and  should  be  the  starting 
point  of  empirical  analysis.  Researchers  assuming  that  non-response  is  ignorable  or  exogenous 
are  imposing  the  following  restriction 

Pr[y  =  1/Male,  2  =  1]  =  Pr[y  =  1/Male,  2  =  0]  =  Pr [y  =  1/Male]  =  60/70 
Pr[y  =  1/Female,  2  =  1]  =  Pr[y  =  1/Female,  2  =  0]=  Pr [y  =  1/Female]  =  20/40 

To  the  extent  that  these  probabilities  are  different  casts  doubt  on  the  ignorable  non-response 
assumption. 


CHAPTER  14 

Time-Series  Analysis 

14.1  Introduction 

There  has  been  an  enormous  amount  of  research  in  time-series  econometrics,  and  many  eco¬ 
nomics  departments  have  required  a  time-series  econometrics  course  in  their  graduate  sequence. 
Obviously,  one  chapter  on  this  topic  will  not  do  it  justice.  Therefore,  this  chapter  will  focus  on 
some  of  the  basic  concepts  needed  for  such  a  course.  Section  14.2  defines  what  is  meant  by  a 
stationary  time-series,  while  sections  14.3  and  14.4  briefly  review  the  Box-Jenkins  and  Vector 
Autoregression  (VAR)  methods  for  time-series  analysis.  Section  14.5  considers  a  random  walk 
model  and  various  tests  for  the  existence  of  a  unit  root.  Section  14.6  studies  spurious  regressions 
and  trend  stationary  versus  difference  stationary  models.  Section  14.7  gives  a  simple  explanation 
of  the  concept  of  cointegration  and  illustrates  it  with  an  economic  example.  Finally,  section  14.8 
looks  at  Autoregressive  Conditionally  Heteroskedastic  (ARCH)  time-series. 


14.2  Stationarity 

Figure  14.1  plots  the  consumption  and  personal  disposable  income  data  considered  in  Chapter 
5.  This  was  done  using  EViews.  This  is  annual  data  from  1959  to  2007  expressed  in  real  terms. 
Both  series  seem  to  be  trending  upwards  over  time.  This  may  be  an  indication  that  these 
time-series  are  non-stationary.  Having  a  time-series  xt  that  is  trending  upwards  over  time  may 
invalidate  all  the  standard  asymptotic  theory  we  have  been  relying  upon  in  this  book.  In  fact, 
Ylt=i  rf./T  may  not  tend  to  a  finite  limit  as  T  — >  oo  and  using  regressors  such  as  xt  means  that 
X'X/T  does  not  tend  in  probability  limits  to  a  finite  positive  definite  matrix,  see  problem  6. 


Figure  14.1  U.S.  Consumption  and  Income,  1959-2007 

B.H.  Baltagi,  Econometrics,  Springer  Texts  in  Business  and  Economics,  DOI  10. 1007/978-3-642-20059-5  14, 
©  Springer-Verlag  Berlin  Heidelberg  201 1 


373 


374 


Chapter  14:  Time-Series  Analysis 


Non-standard  asymptotic  theory  will  have  to  be  applied  which  is  beyond  the  scope  of  this  book, 
see  problem  8. 

Definition:  A  time-series  process  xt  is  said  to  be  covariance  stationary  (or  weakly  stationary) 
if  its  mean  and  variance  are  constant  and  independent  of  time  and  the  covariances  given  by 
co v(xt,xt-s)  =  7 s  depend  only  upon  the  distance  between  the  two  time  periods,  but  not  the 
time  periods  per  se. 

In  order  to  check  the  time-series  for  weak  stationarity  one  can  compute  its  autocorrelation 
function.  This  is  given  by  ps=  correlation  (xt,xt-s)  =  Is/lo-  These  are  correlation  coefficients 
taking  values  between  —1  and  +1. 

The  sample  counterparts  of  the  variance  and  covariances  are  given  by 
7o  =  Efc=i(*t-®)2/r 
Is  =  T,J=i(xt  -  x)(xt+s  -  x)/T 

and  the  sample  autocorrelation  function  is  given  by  =  7s/70.  Figure  14.2  plots  against 
s  for  the  consumption  series.  This  is  called  the  sample  correlogram.  For  a  stationary  process, 
ps  declines  sharply  as  the  number  of  lags  s  increase.  This  is  not  necessarily  the  case  for  a 
nonstationary  series.  In  the  next  section,  we  briefly  review  a  popular  method  for  the  analysis 
of  time-series  known  as  the  Box  and  Jenkins  (1970)  technique.  This  method  utilizes  the  sample 
autocorrelation  function  to  establish  whether  a  series  is  stationary  or  not. 


Sample:  1959  2007 

Included  observations:  49 

Autocorrelation  Partial  Correlation 

AC 

PAC 

Q-Stat 

Prob 

★★★★★★★ 

★★★★★★★ 

1 

0.935 

0.935 

45.500 

0.000 

★★★★★★ 

2 

0.868 

-0.050 

85.527 

0.000 

★★★★★★ 

3 

0.800 

-0.042 

120.26 

0.000 

★★★★★ 

4 

0.733 

-0.029 

150.08 

0.000 

★★★★★ 

5 

0.668 

-0.024 

175.39 

0.000 

★★★★ 

6 

0.604 

-0.030 

196.57 

0.000 

★★★★ 

7 

0.541 

-0.029 

214.00 

0.000 

★★★ 

8 

0.480 

-0.033 

228.02 

0.000 

★★★ 

9 

0.421 

-0.015 

239.12 

0.000 

★★★ 

10 

0.369 

0.004 

247.83 

0.000 

★★ 

11 

0.320 

-0.009 

254.57 

0.000 

★★ 

12 

0.272 

-0.033 

259.57 

0.000 

★★ 

13 

0.226 

-0.027 

263.10 

0.000 

★  ^ 

14 

0.181 

-0.021 

265.45 

0.000 

★  _ 

15 

0.140 

-0.013 

266.88 

0.000 

★  _ 

16 

0.097 

-0.052 

267.60 

0.000 

17 

0.055 

-0.036 

267.83 

0.000 

18 

0.011 

-0.052 

267.84 

0.000 

19 

-0.032 

-0.034 

267.93 

0.000 

.it 

20 

-0.073 

-0.026 

268.39 

0.000 

Figure  14.2  Correlogram  of  Consumption 
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14.3  The  Box  and  Jenkins  Method 

This  method  fits  Autoregressive  Integrated  Moving  Average  (ARIMA)  type  models.  We  have 
already  considered  simple  AR  and  MA  type  models  in  Chapters  5  and  6.  The  Box-Jenkins 
methodology  differences  the  series  and  looks  at  the  sample  correlogram  of  the  differenced  series 
to  see  whether  stationarity  is  achieved.  As  will  be  clear  shortly,  if  we  have  to  difference  the 
series  once,  twice  or  three  times  to  make  it  stationary,  this  series  is  integrated  of  order  1,  2 
or  3,  respectively.  Next,  the  Box-Jenkins  method  looks  at  the  autocorrelation  function  and 
the  partial  autocorrelation  function  (synonymous  with  partial  correlation  coefficients)  of  the 
resulting  stationary  series  to  identify  the  order  of  the  AR  and  MA  process  that  is  needed.  The 
partial  correlation  between  yt  and  yt-s  is  the  correlation  between  those  two  variables  holding 
constant  the  effect  of  all  intervening  lags,  see  Box  and  Jenkins  (1970)  for  details.  Figure  14.3 
plots  an  AR(1)  process  of  size  T  =  250  generated  as  yt  =  0.7yt-i+et  with  et  ~  IIN(0,4).  Figure 

14.4  shows  that  the  correlogram  of  this  AR(1)  process  declines  geometrically  as  s  increases. 
Similarly,  Figure  14.5  plots  an  MA(1)  process  of  size  T  =  250  generated  as  yt  =  £t  +  0.4et_i 
with  et  ~  IIN(0,4).  Figure  14.6  shows  that  the  correlogram  of  this  MA(1)  process  is  zero  after 
the  first  lag,  see  also  problems  1  and  2  for  further  analysis.  Identifying  the  right  ARIMA  model 
is  not  an  exact  science,  but  potential  candidates  emerge.  These  models  are  estimated  using 
maximum  likelihood  techniques.  Next,  these  models  are  subjected  to  some  diagnostic  checks. 
One  commonly  used  check  is  to  see  whether  the  residuals  are  White  noise.  If  they  fail  this  test, 
these  models  are  dropped  from  the  list  of  viable  candidates. 


Figure  14.3  AR(1)  Process,  p  =  0.7 


If  the  time-series  is  White  noise,  i.e.,  purely  random  with  constant  mean  and  variance  and  zero 
autocorrelation,  then  ps  =  0  for  s  >  0.  In  fact,  for  a  White  noise  series,  if  T  — »  oo,  VTf)s  will  be 
asymptotically  distributed  1V(0, 1).  A  joint  test  for  H0~,  ps  =  0  for  s  =  1, 2, . . . ,  m  lags,  is  given 
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Sample:  1  250 
Included  observations:  250 


Autocorrelation  Partial  Correlation  AC  PAC  Q-Stat  Prob 


★★★★★★  1 

★★★★★★ 

1 

0.725 

0.725 

132.99 

0.000 

★★★★  1 

2 

0.503 

-0.048 

197.27 

0.000 

★★★  1 

3 

0.330 

-0.037 

225.05 

0.000 

★★  1 

4 

0.206 

-0.016 

235.92 

0.000 

★  1 

5 

0.115 

-0.022 

239.33 

0.000 

1 

6 

0.036 

-0.050 

239.67 

0.000 

1 

7 

-0.007 

0.004 

239.68 

0.000 

1 

8 

-0.003 

0.050 

239.68 

0.000 

1 

9 

-0.017 

-0.041 

239.75 

0.000 

★ 

10 

-0.060 

-0.083 

240.71 

0.000 

★ 

11 

-0.110 

-0.063 

243.91 

0.000 

★ 

12 

-0.040 

0.191 

244.32 

0.000 

Figure  14.4  Correlogram  of  AR(1) 


by  the  Box  and  Pierce  (1970)  statistic 


Q  =  TET=iP2s  (14.1) 

This  is  asymptotically  distributed  under  the  null  as  Xm-  A  refinement  of  the  Box-Pierce  Q- 
statistic  that  performs  better,  i.e. ,  have  more  power  in  small  samples  is  the  Ljung  and  Box 
(1978)  Qlb  statistic 
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Sample:  1  250 
Included  observations:  250 


Autocorrelation 

Partial  Correlation 

AC 

PAC 

Q-Stat 

Prob 

★★★ 

★★★ 

1 

0.399 

0.399 

40.240 

0.000 

★ 

2 

0.033 

-0.150 

40.520 

0.000 

★ 

3 

0.010 

0.066 

40.545 

0.000 

4 

-0.002 

-0.033 

40.547 

0.000 

★ 

★ 

5 

0.090 

0.127 

42.624 

0.000 

6 

0.055 

-0.045 

43.417 

0.000 

7 

0.028 

0.042 

43.625 

0.000 

★ 

8 

-0.031 

-0.075 

43.881 

0.000 

9 

-0.034 

0.023 

44.190 

0.000 

10 

-0.027 

-0.045 

44.374 

0.000 

11 

-0.013 

0.020 

44.421 

0.000 

★ 

★ 

12 

0.082 

0.086 

46.190 

0.000 

Figure  14.6  Correlogram  of  MA(1) 


Qlb  =  T(T  +  2)  £7=,  p2j/(T  -  j )  (14.2) 

This  is  also  distributed  asymptotically  as  Xm  under  the  null  hypothesis.  Maddala  (1992,  p.  540) 
warns  about  the  inappropriateness  of  the  Q  and  Qlb  statistics  for  autoregressive  models.  The 
arguments  against  their  use  are  the  same  as  those  for  not  using  the  Durbin- Watson  statistic  in 
autoregressive  models.  Maddala  (1992)  suggests  the  use  of  LM  statistics  of  the  type  proposed 
by  Godfrey  (1979)  to  evaluate  the  adequacies  of  the  ARMA  model  proposed. 

For  the  consumption  series,  T  =  49  and  the  95%  confidence  interval  for  'ps  is  0±1.96  (l/\/49) 
which  is  ±0.28.  Figure  14.2  plots  this  95%  confidence  interval  as  two  solid  lines  around  zero.  It 
is  clear  that  the  sample  correlogram  declines  slowly  as  the  number  of  lags  s  increase.  Moreover, 
the  Qlb  statistics  which  are  reported  for  lags  1,  2,  up  to  13  are  all  statistically  significant.  These 
were  computed  using  EViews.  Based  on  the  sample  correlogram  and  the  Ljung-Box  statistic,  the 
consumption  series  is  not  purely  random  white  noise.  Figure  14.7  plots  the  sample  correlogram 
for  A Ct  =  Ct  —  Ct~ i-  Note  that  this  sample  correlogram  dies  out  abruptly  after  the  first  lag. 
Also,  the  Qlb  statistics  are  not  significant  after  the  first  lag.  This  indicates  stationarity  of  the 
first-differenced  consumption  series.  Problem  3  asks  the  reader  to  plot  the  sample  correlogram 
for  personal  disposable  income  and  its  first  difference,  and  to  compute  the  Ljung-Box  Qlb 
statistic  to  test  for  purely  White  noise  based  on  13  lags. 

A  difficult  question  when  modeling  economic  behavior  is  to  decide  on  what  lags  should  be  in 
the  ARIMA  model,  or  the  dynamic  regression  model.  Granger  et  al.  (1995)  argue  that  there  are 
disadvantages  in  using  hypothesis  testing  to  help  make  model  specification  decisions  based  on 
the  data.  They  recommend  instead  the  use  of  model  selection  criteria  to  make  those  decisions. 

The  Box-Jenkins  methodology  has  been  popular  primarily  among  forecasters  who  claimed 
better  performance  than  simultaneous  equations  models  based  upon  economic  theory.  Box- 
Jenkins  models  are  general  enough  to  allow  for  nonstationarity  and  can  handle  seasonality. 
However,  the  Box-Jenkins  models  suffer  from  the  fact  that  they  are  devoid  of  economic  theory 
and  as  such  they  are  not  designed  to  test  economic  hypothesis,  or  provide  estimates  of  key 
elasticity  parameters.  As  a  consequence,  this  method  cannot  be  used  for  simulating  the  effects 
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Sample:  1959  2007 
Included  observations:  48 


Autocorrelation 

Partial  Correlation 

AC 

PAC 

Q-Stat 

Prob 

★★★★ 

★  ★★★ 

1 

0.465 

0.465 

11.059 

0.001 

★★ 

2 

0.065 

-0.194 

11.277 

0.004 

'-k 

3 

-0.129 

-0.099 

12.160 

0.007 

★  # 

4 

-0.048 

0.101 

12.288 

0.015 

5 

-0.012 

-0.053 

12.296 

0.031 

6 

0.015 

0.015 

12.309 

0.055 

7 

-0.016 

-0.027 

12.324 

0.090 

'• k 

#  ★ 

8 

-0.115 

-0.133 

13.123 

0.108 

9 

-0.081 

0.056 

13.528 

0.140 

★ 

★  ^ 

10 

0.124 

0.194 

14.501 

0.151 

★  * 

11 

0.194 

0.001 

16.944 

0.110 

★  ^ 

12 

0.098 

-0.027 

17.589 

0.129 

★  # 

13 

0.063 

0.120 

17.861 

0.163 

14 

0.034 

-0.017 

17.941 

0.209 

15 

0.027 

0.016 

17.994 

0.263 

16 

0.028 

0.034 

18.051 

0.321 

17 

0.026 

-0.026 

18.105 

0.382 

~k~k 

18 

-0.152 

-0.193 

19.959 

0.335 

19 

-0.029 

0.279 

20.027 

0.393 

★  < 

20 

0.101 

0.016 

20.902 

0.403 

Figure  14.7  Correlogram  of  First  Difference  of  Consumption 


of  a  tax  change  or  a  Federal  Reserve  policy  change.  One  lesson  that  economists  learned  from 
the  Box-Jenkins  methodology  is  that  they  have  to  take  a  hard  look  at  the  time-series  properties 
of  their  variables  and  properly  specify  the  dynamics  of  their  economic  models.  Another  popular 
forecasting  technique  in  economics  is  the  Vector  Autoregression  (VAR)  methodology  proposed 
by  Sims  (1980).  This  will  be  briefly  discussed  next. 


14.4  Vector  Autoregression 

Sims  (1980)  criticized  the  simultaneous  equation  literature  for  the  ad  hoc  restrictions  needed 
for  identification  and  for  the  ad  hoc  classification  of  exogenous  and  endogenous  variables  in  the 
system,  see  Chapter  11.  Instead,  Sims  (1980)  suggested  Vector  Autoregression  (VAR)  models  for 
forecasting  macro  time-series.  VAR  assumes  that  all  the  variables  are  endogenous.  For  example, 
consider  the  following  three  macro-series:  money  supply,  interest  rate,  and  output.  VAR  models 
this  vector  of  three  endogenous  variables  as  an  autoregressive  function  of  their  lagged  values. 
VAR  models  can  include  some  exogenous  variables  like  trends  and  seasonal  dummies,  but  the 
whole  point  is  that  it  does  not  have  to  classify  variables  as  endogenous  or  exogenous.  If  we 
allow  5  lags  on  each  endogenous  variable,  each  equation  will  have  16  parameters  to  estimate 
if  we  include  a  constant.  For  example,  the  money  supply  equation  will  be  a  function  of  5  lags 
on  money,  5  lags  on  the  interest  rate  and  5  lags  on  output.  Since  the  parameters  are  different 
for  each  equation  the  total  number  of  parameters  in  this  unrestricted  VAR  is  3  x  16  =  48 
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parameters.  This  degrees  of  freedom  problem  becomes  more  serious  as  the  number  of  lags  m 
and  number  of  equations  g  increase.  In  fact,  the  number  of  parameters  to  be  estimated  becomes 
g  +  mg2.  With  small  samples,  individual  parameters  may  not  be  estimated  precisely.  So,  only 
simple  VAR  models,  can  be  considered  for  a  short  sample.  Since  this  system  of  equations  has 
the  same  set  of  variables  in  each  equation  SUR  on  the  system  is  equivalent  to  OLS  on  each 
equation,  see  Chapter  10.  Under  normality  of  the  disturbances,  MLE  as  well  as  Likelihood  Ratio 
tests  can  be  performed.  One  important  application  of  LR  tests  in  the  context  of  VAR  is  its  use 
in  determining  the  choice  of  lags  to  be  used.  In  this  case,  one  obtains  the  log-likelihood  for  the 
restricted  model  with  m  lags  and  the  unrestricted  model  with  q>  m  lags.  This  LR  test  will  be 
asymptotically  distributed  as  xfq_rn)g2  ■  Once  again,  the  sample  size  T  should  be  large  enough 
to  estimate  the  large  number  of  parameters  ( qg 2  +  g)  for  the  unrestricted  model. 

One  can  of  course  impose  restrictions  to  reduce  the  number  of  parameters  to  be  estimated, 
but  this  reintroduces  the  problem  of  ad  hoc  restrictions  which  VAR  was  supposed  to  cure  in 
the  first  place.  Bayesian  VAR  procedures  claim  success  with  forecasting,  see  Litterman  (1986), 
but  again  these  models  have  been  criticized  because  they  are  devoid  of  economic  theory. 

VAR  models  have  also  been  used  to  test  the  hypothesis  that  some  variables  do  not  Granger 
cause  some  other  variables.1  For  a  two-equation  VAR,  as  long  as  this  VAR  is  correctly  specified 
and  no  variables  are  omitted,  one  can  test,  for  example,  that  y±  does  not  Granger  cause  2/2  ■  This 
hypothesis  cannot  be  rejected  if  all  the  m  lagged  values  of  y±  are  insignificant  in  the  equation 
for  •  This  is  a  simple  F-test  for  the  joint  significance  of  the  lagged  coefficients  of  2/1  in  the 
?/2  equation.  This  is  asymptotically  distributed  as  Fm^T_^2rn+i)-  The  problem  with  the  Granger 
test  for  non-causality  is  that  it  may  be  sensitive  to  the  number  of  lags  m,  see  Gujarati  (1995). 
For  an  extensive  analysis  of  nonstationary  VAR  models  as  well  as  testing  and  estimation  of 
cointegrating  relationships  in  VAR  models,  see  Hamilton  (1994)  and  Liitkepohl  (2001). 


14.5  Unit  Roots 

If  xt  =  xt- 1  +  ut  where  ut  is  IID(0,  a2),  then  xt  is  a  random  walk.  Some  stock  market  analysts 
believe  that  stock  prices  follow  a  random  walk,  i.e. ,  the  price  of  a  stock  today  is  equal  to  its 
price  yesterday  plus  a  random  shock.  This  is  a  nonstationary  time-series.  Any  shock  to  the  price 
of  this  stock  is  permanent  and  does  not  die  out  like  an  AR(1)  process.  In  fact,  if  the  initial  price 
of  the  stock  is  xQ  =  p,  then 

xi  =  pL  +  Mi,  X2  =  n  +  ui  +  U2,  ■  ■  ■ ,  and  xt  =  y  +  Y?j= 1  uj 

with  E(xt )  =  p  and  var(xt)  =  ta2  since  u  ~  IID(0,  a2).  Therefore,  the  variance  of  xt  is  dependent 
on  t  and  xt  is  not  covariance-stationary.  In  fact,  as  i  — >  00,  so  does  var(xt).  However,  first 
differencing  xt  we  get  ut  which  is  stationary.  Figure  14.8  plots  the  graph  of  a  random  walk 
of  size  T  =  250  generated  as  xt  =  xt- 1  +  e*  with  e*  ~  IIN(0,4).  Figure  14.9  shows  that  the 
autocorrelation  function  of  this  random  walk  process  is  persistent  as  s  increases.  Note  that  a 
random  walk  is  an  AR(1)  model  xt  =  pxt-i  +  ut  with  p  =  1.  Therefore,  a  test  for  nonstationarity 
is  a  test  for  p  =  1  or  a  test  for  a  unit  root. 

Using  the  lag  operator  L  we  can  write  the  random  walk  as  (1  —  L)xt  =  ut  and  in  general,  any 
autoregressive  model  in  xt  can  be  written  as  A(L)xt  =  ut  where  A(L)  is  a  polynomial  in  L.  If 
A(L)  has  (1  —  L)  as  one  of  its  roots,  then  xt  has  a  unit  root. 


380 


Chapter  14:  Time-Series  Analysis 


Figure  14.8  Random  Walk  Process 

Subtracting  xt~  i  from  both  sides  of  the  AR(1)  model  we  get 

A xt  =  (p  —  l)xt-i  +  ut  =  6xt- i  +  ut  (14.3) 

where  6  =  p  —  1  and  A  xt  =  xt  —  Xt—  i  is  the  first-difference  of  xt-  A  test  for  Ha]  p  =  1  can  be 
obtained  by  regressing  A  xt  on  xt-i  and  testing  that  H0 ;  6  =  0.  Since  ut  is  stationary  then  if 
6  =  0,  A  xt  =  ut  and  xt  is  difference  stationary,  i.e. ,  it  becomes  stationary  after  differencing  it 
once.  In  this  case,  the  original  undifferenced  series  xt  is  said  to  be  integrated  of  order  1  or  7(1). 
If  we  need  to  difference  xt  twice  to  make  it  stationary,  then  xt  is  1(2).  A  stationary  process 
is  by  definition  1(0).  Dickey  and  Fuller  (1979)  showed  that  the  usual  regression  t-statistic  for 
H0 ;  6  =  0  from  (14.3)  does  not  have  a  f-distribution  under  H0.  In  fact,  this  f-statistic  has  a 
non-standard  distribution,  see  Bierens  (2001)  for  a  simple  derivation  of  these  results.  Dickey  and 
Fuller  tabulated  the  critical  values  of  the  f-statistic  =  (p  —  l)/s.e.(p)  =  6/s.e.(6)  using  Monte 
Carlo  experiments.  These  tables  have  been  extended  by  MacKinnon  (1991).  If  \t\  exceeds  the 
critical  values,  we  reject  Ha  that  p  =  1  which  also  means  that  we  do  not  reject  the  hypothesis 
of  stationarity  of  the  time-series.  Non-rejection  of  Ha;  p  =  1  means  that  we  do  not  reject  the 
presence  of  a  unit  root  and  hence  the  nonstationarity  of  the  time-series.  Note  that  non-rejection 
of  Ha  may  also  be  a  non-rejection  of  p  =  0.99.  More  formally  stated,  a  weakness  of  unit  root 
tests  in  general  is  that  they  have  low  power  discriminating  between  a  unit  root  process  and 
a  borderline  stationary  process.  In  practice,  the  Dickey-Fuller  test  has  been  applied  in  the 
following  three  forms: 

A  xt  =  6xt~i  +  ut  (14.4) 

A  xt  =  a  +  Sxt-i  +  ut  (14-5) 

A  xt  =  a  +  pt  +  8xt-\  +  ut  (14.6) 

where  t  is  a  time  trend.  The  null  hypothesis  of  the  existence  of  a  unit  root  is  H0]  6  =  0.  This  is 
the  same  for  (14.4),  (14.5)  and  (14.6),  but  the  critical  values  for  the  corresponding  f-statistics 
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Sample:  1  250 
Included  observations:  250 


Autocorrelation  Partial  Correlation  AC  PAC  Q-Stat  Prob 


★★★★★★★★ 

★★★★★★★★ 

1 

0.980 

0.980 

242.76 

0.000 

★★★★★★★ 

2 

0.959 

-0.003 

476.56 

0.000 

★★★★★★★ 

3 

0.940 

0.004 

701.83 

0.000 

★★★★★★★ 

4 

0.920 

-0.013 

918.61 

0.000 

★★★★★★★ 

5 

0.899 

-0.044 

1126.4 

0.000 

★★★★★★★ 

6 

0.876 

-0.053 

1324.6 

0.000 

★★★★★★★ 

7 

0.855 

0.028 

1514.2 

0.000 

★★★★★★ 

★ 

8 

0.837 

0.067 

1696.7 

0.000 

★★★★★★ 

9 

0.821 

0.032 

1872.8 

0.000 

★★★★★★ 

10 

0.804 

-0.006 

2042.6 

0.000 

★★★★★★ 

11 

0.788 

-0.007 

2206.3 

0.000 

★★★★★★ 

12 

0.774 

0.030 

2364.8 

0.000 

Figure  14.9  Correlogram  of  a  Random  Walk  Process 


are  different  in  each  case.  Standard  time-series  software  like  EViews  give  the  proper  critical 
values  for  the  Dickey-Fuller  statistic.  For  alternative  unit  root  tests,  see  Phillips  and  Perron 
(1988)  and  Bierens  and  Guo  (1993).  In  practice,  one  should  run  (14.6)  if  the  series  is  trended 
with  drift  and  (14.5)  if  it  is  trended  without  drift.  Not  including  a  constant  or  trend  as  in  (14.4) 
is  unlikely  for  economic  data.  The  Box-Jenkins  approach  differences  the  series  and  looks  at 
the  sample  correlogram  of  the  differenced  series.  The  Dickey-Fuller  test  is  a  more  formal  test 
for  the  existence  of  a  unit  root.  Maddala  (1992,  p.  585)  warns  the  reader  to  perform  both  the 
visual  inspection  and  the  unit  root  test  before  deciding  on  whether  the  time-series  process  is 
nonstationary. 

If  the  disturbance  term  ut  follows  a  stationary  AR(1)  process,  then  the  augmented  Dickey- 
Fuller  test  runs  the  following  modified  version  of  (14.6)  by  including  one  additional  regressor, 
Axt-i: 

Axt  =  a  +  /3t  +  6xt-i  +  XAxt-i  +  et  (14.7) 

In  this  case,  the  f-statistic  for  6  =  0  is  a  unit  root  test  allowing  for  first-order  serial  correlation. 
This  augmented  Dickey-Fuller  test  in  (14.7)  has  the  same  asymptotic  distribution  as  the  corre¬ 
sponding  Dickey-Fuller  test  in  (14.6)  and  the  same  critical  values  can  be  used.  Similarly,  if  ut 
follows  a  stationary  AR(p)  process,  this  amounts  to  adding  p  extra  regressors  in  (14.6)  consist¬ 
ing  of  Axt-i,  Axt-2,  •  •  • ,  Axt-p  and  testing  that  the  coefficient  of  xt- 1  is  zero.  In  practice,  one 
does  not  know  the  process  generating  the  serial  correlation  in  ut  and  the  general  practice  is  to 
include  as  many  lags  of  Axt  as  is  necessary  to  render  the  et  term  in  (14.7)  serially  uncorrelated. 
More  lags  may  be  needed  if  the  disturbance  term  contains  Moving  Average  terms,  since  a  MA 
term  can  be  thought  of  as  an  infinite  autoregressive  process,  see  Ng  and  Perron  (1995)  for  an 
extensive  Monte  Carlo  on  the  selection  of  the  truncation  lag.  Two  other  important  complica¬ 
tions  when  doing  unit  root  tests  are:  (i)  structural  breaks  in  the  time-series,  like  the  oil  embargo 
of  1973,  tend  to  bias  the  standard  unit  root  tests  against  rejecting  the  null  hypothesis  of  a  unit 
root,  see  Perron  (1989).  (ii)  Seasonally  adjusted  data  also  tend  to  bias  the  standard  unit  root 
tests  against  rejecting  the  null  hypothesis  of  a  unit  root,  see  Ghysels  and  Perron  (1992).  For 
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this  reason,  Davidson  and  MacKinnon  (1993,  p.  714)  suggest  using  seasonally  unadjusted  data 
whenever  available. 

For  the  trended  consumption  series  with  drift,  the  Augmented  Dickey-Fuller  test  using  EViews 
yields  the  following  regression: 

A Ct  =  665.60  +  30.57  t-  0.072  Ct- 1+  0.449  ACt-i  +residuals 

(1.80)  (1.60)  (1.42)  (3.17)  ^  ’ 

where  the  numbers  in  parentheses  are  the  usually  reported  ^statistics.  The  null  hypothesis 
is  that  the  coefficient  of  Ct- 1  in  this  regression  is  zero.  Table  14.1  gives  the  Dickey-Fuller  t- 
statistic  (—1.42)  and  the  corresponding  5%  critical  value  (—3.508)  tabulated  by  MacKinnon 
(1996).  Note  that  @TREND(1959)  is  the  time-trend  starting  at  1959.  The  Schwarz  criterion 
found  the  optimal  number  of  lags  of  ACt_i  to  be  included  in  this  regression  is  one.  Since  the 
p- value  is  0.84,  we  do  not  reject  the  null  hypothesis  of  the  existence  of  a  unit  root.  We  conclude 
that  Ct  is  nonstationary.  This  confirms  our  finding  from  the  sample  correlogram  of  Ct  given  in 
Figure  14.2. 


Table  14.1  Dickey-Fuller  Test 


Null  Hypothesis:  CONSUMP  has  a  unit  root 
Exogenous:  Constant,  Linear  Trend 

Lag  Length:  1  (Automatic  based  on  SIC,  MAXLAG= 

=10) 

t-Statistic 

Prob.* 

Augmented  Dickey-Fuller  test  statistic 

-1.418937 

0.8424 

Test  critical  values: 

1%  level 

-4.165756 

5%  level 

-3.508508 

10%  level 

-3.184230 

*  MacKinnon  (1996)  one- 

-sided  p-values. 

Augmented  Dickey-Fuller  Test  Equation 

Dependent  Variable: 

D(CONSUMP) 

Method: 

Least  Squares 

Sample  (adjusted): 

1961  2007 

Included  observations: 

47  after  adjustments 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

CONSUMP(-l) 

-0.072427 

0.051043 

-1.418937 

0.1631 

D  (CONSUMP  (-1)) 

0.448898 

0.141516 

3.172064 

0.0028 

C 

665.6031 

370.5970 

1.796030 

0.0795 

@TREND(1959) 

30.56963 

19.13925 

1.597221 

0.1175 

R-squared 

0.291770 

Mean  dependent 

var 

393.2340 

Adjusted  R-squared 

0.242359 

S.D.  dependent  var 

254.9362 

S.E.  of  regression 

221.9031 

Akaike  info  criterion 

13.72362 

Sum  squared  resid 

2117363. 

Schwarz  criterion 

13.88108 

Log  likelihood 

-318.5052 

Hannah-Quinn  criter. 

13.78288 

F-statistic 

5.904911 

Durbin- Watson  stat 

1.841198 

Prob(F-statistic) 

0.001816 
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One  can  check  whether  the  first-differenced  series  is  stationary  by  performing  a  unit  root  test 
on  the  first- differenced  model.  Let  Ct  =  A Ct,  then  run  the  following  regression: 

A Ct  =  213.77  -  0.533  Ct- i  +  residuals 

(3.59)  (4.13)  (  '9) 

the  coefficient  of  Ct-i  has  a  t-statistic  of  —4.13  which  is  smaller  than  the  5%  critical  value  of 
—2.925.  In  other  words,  we  reject  the  null  hypothesis  of  unit  root  for  the  first-differenced  series 
A  Ct-  The  same  conclusion  would  have  been  reached  if  a  linear  trend  was  included  besides  the 
constant.  We  conclude  that  Ct  is  1(1). 

So  far,  all  tests  for  unit  root  have  the  hypothesis  of  nonstationarity  as  the  null  with  the 
alternative  being  that  the  series  is  stationary.  Two  unit  roots  tests  with  stationarity  as  the  null 
and  nonstationarity  as  the  alternative  are  given  by  Kwaitowski  et  al.  (1992)  and  Leybourne  and 
McCabe  (1994).  The  first  test  known  as  KPSS  is  an  analog  of  the  Phillips-Perron  test  whereas 
the  Leybourne-McCabe  test  is  an  analog  of  the  augmented  Dickey-Fuller  test.  Reversing  the  null 
may  lead  to  confirmation  of  stationarity  or  nonstationarity  or  may  yield  conflicting  decisions. 


14.6  Trend  Stationary  Versus  Difference  Stationary 

Many  macroeconomic  time-series  that  are  trending  upwards  have  been  characterized  as  either 
Trend  Stationary:  xt  =  a  +  (3t  +  ut  (14.10) 


or 


Difference  Stationary:  xt  =  7  +  xt- 1  +  ut  (14.11) 

where  ut  is  stationary.  The  first  model  (14.10)  says  that  the  macro-series  is  stationary  except  for 
a  deterministic  trend.  E{xt)  =  a  +  /3t  which  varies  with  t.  In  contrast,  the  second  model  (14.11) 
says  that  the  macro-series  is  a  random  walk  with  drift.  The  drift  parameter  7  in  (14.11)  plays 
the  same  role  as  the  /3  parameter  in  (14.10),  since  both  cause  xt  to  trend  upwards  over  time. 
Model  (14.10)  is  consistent  with  economists  introducing  a  time  trend  in  the  regression.  This 
has  the  same  effect  as  detrending  each  variable  in  the  regression  rendering  it  stationary,  see  the 
Frisch- Waugh-Lovell  Theorem  in  Chapter  7.  This  detrending  is  valid  only  if  model  (14.10)  is 
true  for  every  series  in  the  regression.  Model  (14.11)  on  the  other  hand,  requires  differencing  to 
obtain  a  stationary  series.  Detrending  and  differencing  are  two  completely  different  remedies. 
What  is  valid  for  one  model  is  not  valid  for  the  other.  The  choice  between  (14.10)  and  (14.11)  is 
based  on  a  test  for  the  existence  of  a  unit  root.  Essential  reading  on  these  two  models  are  Nelson 
and  Plosser  (1982)  and  Stock  and  Watson  (1988).  Nelson  and  Plosser  applied  the  Dickey-Fuller 
test  to  a  wide  range  of  historical  macro  time-series  for  the  U.S.  economy  and  found  that  all  of 
these  series  were  difference  stationary,  with  the  exception  of  the  unemployment  rate.  Plosser  and 
Schwert  (1978)  argued  that  for  most  economic  macro  time-series,  it  is  best  to  difference  the  data 
rather  than  work  with  levels.  The  reasoning  is  that  if  these  series  are  difference  stationary  and 
we  run  regressions  in  levels,  the  usual  properties  of  our  estimators  as  well  as  the  distributions  of 
the  associated  test  statistics  are  invalid.  On  the  other  hand,  if  the  true  model  is  a  regression  in 
levels  with  the  data  series  being  trend  stationary,  differencing  the  model  will  produce  a  Moving 
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Average  error  term  and  at  worst,  ignoring  it  will  lead  to  loss  in  efficiency.  It  is  important  to 
emphasize,  that  for  nonstationary  variables,  the  standard  asymptotic  theory  does  not  apply,  see 
problems  6  and  7,  and  that  t  and  F-statistics  obtained  from  regressions  using  these  variables 
may  have  non-standard  distributions,  see  Durlauf  and  Phillips  (1988). 

Granger  and  Newbold  (1974)  demonstrated  some  of  the  problems  associated  with  regressing 
nonstationary  time-series  on  each  other.  In  fact,  they  showed  that  if  xt  and  yt  are  independent 
random  walks,  then  one  should  expect  to  find  no  evidence  of  a  relationship  when  one  regresses 
yt  on  xt .  In  other  words,  the  estimate  of  (3  in  the  regression  yt  =  a+ (3xt  +  ut  should  be  near  zero 
and  its  associated  ^-statistic  insignifiant.  In  fact,  for  a  sample  size  of  50  and  a  100  replications, 
Granger  and  Newbold  found  \t\  <  2  on  only  23  occasions,  2  <  |f|  <  4  on  24  occasions,  and  \t\  >  4 
on  53  occasions.  Granger  and  Newbold  (1974)  called  this  phenomenon  spurious  regression  since 
it  finds  a  significant  relationship  between  the  two  time-series  when  none  exists.  Hence,  one 
should  be  cautious  when  running  time-series  regressions  involving  unit  root  processes.  High  R2 
and  significant  ^statistics  from  OLS  regressions  may  be  hiding  nonsense  results.  Phillips  (1986) 
studied  the  asymptotic  properties  of  the  least  squares  spurious  regression  model  and  confirmed 
these  simulation  results.  In  fact,  Phillips  showed  that  the  f-statistic  for  Ha;  (3  =  0  converges 
in  probability  to  oo  as  T  — >  oo.  This  means  that  the  f-statistic  will  reject  770;  (3  =  0  with 
probability  1  as  T  — ►  oo.  If  both  xt  and  yt  are  independent  trend  stationary  series  generated  as 
described  in  (14.10),  then  the  R 2  of  the  regression  of  yt  on  xt  will  tend  to  one  as  T  — »  oo,  see 
Davidson  and  MacKinnon  (1993,  p.  671).  For  a  summary  of  several  extensions  of  these  results, 
see  Granger  (2001). 


14.7  Cointegration 


Let  us  continue  with  our  consumption-income  example.  In  Chapter  5,  we  regressed  Ct  on  Yt 
and  obtained 


Ct  =  —1343.31  +  0.979  Yt  +  residuals 
(219.56)  (0.011) 


(14.12) 


with  R2  =  0.994  and  D.W.  =  0.18.  We  have  shown  that  Ct  and  Yt  are  nonstationary  series  and 
that  both  are  7(1),  see  also  problem  3.  The  regression  in  (14.12)  could  be  a  spurious  regression 
owing  to  the  fact  that  we  regressed  a  nonstationary  series  on  another.  This  invalidates  the  t 
and  F-statistics  of  regression  (14.12).  Since  both  Ct  and  Yt  are  integrated  of  the  same  order, 
and  Figure  14.1  shows  that  they  are  trending  upwards  together,  this  random  walk  may  be  in 
unison.  This  is  the  idea  behind  cointegration.  Ct  and  Yt  are  cointegrated  if  there  exists  a  linear 
combination  of  Ct  and  Yt  that  yields  a  stationary  series.  More  formally,  if  Ct  and  Yt  are  both 
1(1)  but  there  exist  a  linear  combination  Ct  —  a  —  (3Yt  =  ut  which  is  7(0),  then  Ct  and  Yt  are 
cointegrated  and  (3  is  the  cointegrating  parameter.  This  idea  can  be  extended  to  a  vector  of  more 
than  two  time-series.  This  vector  is  cointegrated  if  the  components  of  this  vector  have  a  unit 
root  and  there  exists  a  linear  combination  of  this  vector  that  is  stationary.  Such  a  cointegrat¬ 
ing  relationship  can  be  interpreted  as  a  stable  long-run  relationship  between  the  components 
of  this  time-series  vector.  Economic  examples  of  long-run  relationship  include  the  quantity 
theory  of  money,  purchasing  power  parity  and  the  permanent  income  theory  of  consumption. 
The  important  point  to  emphasize  here  is  that  differencing  these  nonstationary  time-series  de¬ 
stroys  potential  valuable  information  about  the  long-run  relationship  between  these  economic 
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variables.  The  theory  of  cointegration  tries  to  estimate  this  long-run  relationship  using  the  non¬ 
stationary  series  themselves,  rather  than  their  first  differences.  In  order  to  explain  this,  we  state 
(without  proof)  one  of  the  implications  of  the  Granger  Representation  Theorem,  namely,  that 
a  set  of  cointegrated  variables  will  have  an  Error-Correction  Model  (ECM)  representation.  Let 
us  illustrate  with  an  example. 


A  Cointegration  Example 


This  is  based  on  Engle  and  Granger  (1987).  Assume  that  Ct  and  Yt  for  t  =  1,2, ...  ,T  are  7(1) 
processes  generated  as  follows: 


Ct-f3Yt  =  ut  with  ut  =  put-i  +  et  and  \p\  <  1  (14.13) 

Ct  —  aYt  =  ut  with  ut  =  ut-  i  +  rjt  and  otj -  /3  (14.14) 

In  other  words,  ut  follows  a  stationary  AR(1)  process,  while  ut  follows  a  random  walk.  Suppose 


that 


Vt 


are  independent  bivariate  normal  random  variables  with  mean  zero  and  variance 


E  =  [&ij\  for  i,  j  =  1,2.  First,  we  obtain  the  reduced  form  representation  of  Y)  and  Ct  in  terms 
of  ut  and  ut  .  This  is  given  by 


Ct  = 


a  /3 

7  a\Ut  Y  7  T\Vt 
(a  -  (3)  (a-  (3) 


(14.15) 


Y,  = 


1  1 

(a  —  (3)  4  (a  —  (3)  t 


(14.16) 


Since  ut  is  1(0)  and  ut  is  1(1),  we  conclude  from  (14.15)  and  (14.16)  that  Ct  and  Yt  are  in 
fact  7(1)  series.  In  terms  of  the  usual  order  condition  for  identification  considered  in  Chapter 
11,  the  system  of  equations  given  by  (14.13)  and  (14.14)  are  not  identified  because  there  are 
no  exclusion  restrictions  on  either  equation.  However,  if  we  take  a  linear  combination  of  the 
two  structural  equations  given  in  (14.13)  and  (14.14),  the  disturbance  of  the  resulting  linear 
combination  is  neither  a  stationary  AR(1)  process  nor  a  random  walk.  Hence,  both  (14.13)  and 
(14.14)  are  identified.  Note  that  if  p  =  1,  then  ut  is  a  random  walk  and  the  linear  combination 
of  ut  and  ut  is  also  a  random  walk.  In  this  case,  neither  (14.13)  nor  (14.14)  are  identified. 


In  the  Engle-Granger  terminology,  Ct  —  (3Yt  is  the  cointegrating  relationship  and  (1,  —  (3)  is  the 
cointegrating  vector.  This  cointegrating  relationship  is  unique.  The  proof  is  by  contradiction. 
Assume  there  is  another  cointegrating  relationship  Ct  —  7 Yj  that  is  7(0),  then  the  difference 
between  the  two  cointegrating  relationships  yields  (7  —  /3)Y).  This  is  also  7(0).  This  can  only 
happen  for  every  value  of  Yj,  which  is  7(1),  if  and  only  if  (3  =  7. 

Difference  both  equations  in  (14.13)  and  (14.14)  and  write  both  differenced  equations  as  a 
system  of  two  equations  in  (AC*,  AYf)7,  one  gets: 


'1-/3' 

'  A  Ct  ' 

A  ut 

et  +  (p-  1  )Ct-i  -  (3(p  -  l)Yt_i 

1  —a 

.  AYt  . 

A  ut 

Vt 

(14.17) 


where  the  second  equality  is  obtained  by  replacing  A  ut  by  r]t,  A  ut  by  (p  —  l)ut~\  +  et,  and 
substituting  for  ut- 1  its  value  (Ct- 1  —  (3Yt-i).  Post-multiplying  (14.17)  by  the  inverse  of  the 
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first  matrix,  one  can  show,  see  problem  9,  that  the  resulting  solution  is  the  following  VAR 
model: 


r  Ac*  i 

1 

L  A  Yt  \ 

(. 0-Oi ) 

-a(p-l) 
~(P~  1) 


a/3(p  -  1) 

P{p~  1) 


(14.18) 


where  ht  and  gt  are  linear  combinations  of  e*  and  gt.  Note  that  if  p  =  1,  then  the  level  variables 
Ct-\  and  Yt-\  drop  from  the  VAR  equations.  Let  Zt  =  Ct~  /3Yt  and  define  8  =  (p  —  l)/(/3  —  a). 
Then  the  VAR  representation  in  (14.18)  can  be  written  as  follows: 

A  Ct  =  -a8Zt-i  +  ht  (14.19) 

A  Yt  =  —6Zt-i  +  gt  (14.20) 

This  is  the  Error- Correction  Model  (ECM)  representation  of  the  original  model.  Zt- 1  is  the 
error  correction  term.  It  represents  a  disequilibrium  term  showing  the  departure  from  long-run 
equilibrium,  see  section  6.4.  Note  that  if  p  =  1,  then  6  =  0  and  Zt- 1  drops  from  both  ECM 
equations.  As  Banerjee  et  al.  (1993,  p.  139)  explain,  this  ECM  representation  is  a  noteworthy 
“...contribution  to  resolving,  or  synthesizing,  the  debate  between  time-series  analysts  and  those 
favoring  econometric  methods.”  The  former  considered  only  differenced  time-series  that  can  be 
legitimately  assumed  stationary,  while  the  latter  focused  on  equilibrium  relationships  expressed 
in  levels.  The  former  wiped  out  important  long-run  relationships  by  first  differencing  them, 
while  the  latter  ignored  the  spurious  regression  problem.  In  contrast,  the  ECM  allows  the  use 
of  first  differences  and  levels  from  the  cointegrating  relationship.  For  more  details,  see  Banerjee 
et  al.  (1993).  A  simple  two-step  procedure  for  estimating  cointegrating  relationships  is  given  by 
Engle  and  Granger  (1987).  In  the  first  step,  the  OLS  estimator  of  /3  is  obtained  by  regressing  Ct 
on  Yt.  This  can  be  shown  to  be  superconsistent ,  i.e. ,  plim  T(/3OLS  —  /3)  — ►  0  as  T  — »  oo.  Using 
Pols  one  obtains  Zt  =  Ct  —  PoLsYt ■  In  the  second  step,  using  Zt- i  rather  than  Zt- 1,  apply 
OLS  to  estimate  the  ECM  in  (14.19)  and  (14.20).  Extensive  Monte  Carlo  experiments  have 
been  conducted  by  Banerjee  et  al.  (1993)  to  investigate  the  bias  of  /3  in  small  samples.  This  is 
pursued  further  in  problem  9.  An  alternative  estimation  procedure  is  the  maximum  likelihood 
approach  suggested  by  Johansen  (1988).  This  is  beyond  the  scope  of  this  book.  See  Dolado  et 
al.  (2001)  for  a  lucid  summary  of  the  cointegration  literature. 

A  formal  test  for  cointegration  is  given  by  Engle  and  Granger  (1987)  who  suggest  running 
regression  (14.12)  and  testing  that  the  residuals  do  not  have  a  unit  root.  In  other  words,  run  a 
Dickey-Fuller  test  or  its  augmented  version  on  the  resulting  residuals  from  (14.12).  In  fact,  if 
Ct  and  Yt  are  not  cointegrated,  then  any  linear  combination  of  them  would  be  nonstationary 
including  the  residuals  of  (14.12).  Since  these  tests  are  based  on  residuals,  their  asymptotic  dis¬ 
tributions  are  not  the  same  as  those  of  the  corresponding  ordinary  unit  roots  tests.  Asymptotic 
critical  values  for  these  tests  can  be  found  in  Davidson  and  MacKinnon  (1993,  p.  722).  For  our 
consumption  regression  the  following  Dickey-Fuller  test  is  obtained  on  the  residuals: 


A ut  =  —1.111  —  0.094  ut  +  residuals 
(0.04)  (1.50) 


(14.21) 


the  Davidson  and  MacKinnon  (1993)  asymptotic  5%  critical  value  for  this  f-statistic  is  —2.92 
and  its  p-value  is  0.52.  Therefore,  we  cannot  reject  the  hypothesis  that  ut  is  nonstationary. 
We  have  also  included  a  trend  and  one  lag  of  the  first-differenced  residuals.  The  resulting 
augmented  Dickey-Fuller  test  did  not  reject  the  existence  of  a  unit  root.  Therefore,  Ct  and 
Yt  are  not  cointegrated.  This  suggests  that  the  relationship  estimated  in  (14.12)  is  spurious. 
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Table  14.2  Johansen  Cointegration  Test 


Sample  (adjusted):  1961  2007 

Included  observations:  47  after  adjustments 

Trend  assumption:  Linear  deterministic  trend 

Series:  CONSUMP  Y 

Lags  interval  (in  first  differences):  1  to  1 


Unrestricted  Cointegration  Rank  Test  (Trace) 


Hypothesized 

Trace 

0.05 

No.  of  CE(s) 

Eigenvalue 

Statistic 

Critical  Value 

Prob.** 

None 

0.120482 

6.322057 

15.49471 

0.6575 

At  most  1 

0.006112 

0.288141 

3.841466 

0.5914 

Trace  test  indicates  no  cointegration  at  the  0.05  level 
*  denotes  rejection  of  the  hypothesis  at  the  0.05  level 
**  MacKinnon-Haug-Michelis  (1999)  p-values 


Unrestricted  Cointegration  Rank  Test  (Maximum  Eigenvalue) 


Hypothesized 

No.  of  CE(s) 

Eigenvalue 

Max-Eigen 

Statistic 

0.05 

Critical  Value 

Prob.** 

None 

0.120482 

6.033916 

14.26460 

0.6089 

At  most  1 

0.006112 

0.288141 

3.841466 

0.5914 

Max-eigenvalue  test  indicates  no  cointegration  at  the  0.05  level 
*  denotes  rejection  of  the  hypothesis  at  the  0.05  level 
**  MacKinnon-Haug-Michelis  (1999)  p-values 

Regressing  an  1(1)  series  on  another  lead  to  spurious  results  unless  they  are  cointegrated.  Of 
course,  other  7(1)  series  may  have  been  erroneously  excluded  from  (14.12)  which  when  included 
may  result  in  a  cointegrating  relationship  among  the  resulting  variables.  In  other  words,  Ct 
and  Yt  may  not  be  cointegrated  because  of  an  omitted  variables  problem.  Table  14.2  gives  the 
Johansen  (1995)  cointegration  test  reported  by  EViews  which  is  beyond  the  scope  of  this  book. 
The  null  hypothesis  is  that  of  no  cointegration  or  at  most  one  cointegration  relationship.  Both 
hypotheses  are  not  rejected  by  the  trace  and  maximum  eigenvalue  tests. 


14.8  Autoregressive  Conditional  Heteroskedasticity 

Financial  time-series  such  as  foreign  exchange  rates,  inflation  rates  and  stock  prices  may  exhibit 
some  volatility  which  varies  over  time.  In  the  case  of  inflation  or  foreign  exchange  rates  this 
could  be  due  to  changes  in  the  Federal  Reserve’s  policies.  In  the  case  of  stock  prices  this  could 
be  due  to  rumors  about  a  certain  company’s  merger  or  takeover.  This  suggests  that  the  variance 
of  these  time-series  may  be  heteroskedastic.  Engle  (1982)  modeled  this  heteroskedasticity  by 
relating  the  conditional  variance  of  the  disturbance  term  at  time  t  to  the  size  of  the  squared 
disturbance  terms  in  the  recent  past.  A  simple  Autoregressive  Conditionally  Heteroskedastic 
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(ARCH)  model  is  given  by 

=  E(ut/Ct)  =  7o  +  7i«t-i  +  -  +  lpUt_p  (14.22) 

where  denotes  the  information  set  upon  which  the  variance  of  ut  is  to  be  conditioned.  This 
typically  includes  all  the  information  available  prior  to  period  t.  In  (14.22),  the  variance  of 
ut  conditional  on  the  information  prior  to  period  t  is  an  autoregressive  function  of  order  p  in 
squared  lagged  values  of  ut ■  This  is  called  an  ARCH(p)  process.  Since  (14.22)  is  a  variance,  this 
means  that  all  the  7j’s  for  i  =  0,1, . . .  ,p  have  to  be  non-negative.  Engle  (1982)  showed  that  a 
simple  test  for  homoskedasticity,  i.e. ,  H0\  7i  =  72  =  ••  =  7P  =  0,  can  be  based  upon  an  ordinary 
E-test  which  regresses  the  squared  OLS  residuals  (e2)  on  their  lagged  values  (e|_1, . . . ,  e2_p) 
and  a  constant.  The  E-statistic  tests  the  joint  significance  of  the  regressors  and  is  reported 
by  most  regression  packages.  Alternatively,  one  can  compute  T  times  the  centered  R 2  of  this 
regression  and  this  is  distributed  as  Xp  under  the  null  hypothesis  Ha.  This  test  resembles  the 
usual  homoskedasticity  tests  studied  in  Chapter  5  except  that  the  squared  OLS  residuals  are 
regressed  upon  their  lagged  values  rather  than  some  explanatory  variables. 

The  simple  ARCH(l)  process 

A  =  7  o  +  7i«t-i  (14.23) 

can  be  generated  as  follows:  ut  =  bo  +  7 iut-i\l^et  where  et  ~  IID(0,1).  Note  that  the  sim¬ 
plifying  variance  of  unity  for  et  can  be  achieved  by  rescaling  the  parameters  70  and  7^  In  this 
case,  the  conditional  mean  of  ut  is  given  by 

E(ut/Q  =  bo  +  li^t-if/2E{et/Q  =  0 

since  u2_1  is  known  at  time  t.  Similarly,  the  conditional  variance  can  be  easily  obtained  from 

E{ut/Ct )  =  bo  +  7i^2_i]E(et2/Ct)  =  7o  +  7i«?-i 

since  E(e2)  =  1.  Also,  the  conditional  covariances  can  be  easily  shown  to  be  zero  since 

E(utut-s/(t)  =  ut-sE{ut/(t)  =  0  for  s  =  l,2,...,t. 

The  unconditional  mean  can  be  obtained  by  taking  repeated  conditional  expectations  period 
by  period  until  we  reach  the  initial  period,  see  the  Appendix  to  Chapter  2.  For  example,  taking 
the  conditional  expectation  of  E(ut/(t )  based  on  information  prior  to  period  t  —  1,  we  get 

E[E(ut/Ct)/Ct-i]  =  E(0/Ct-i)  =  0 

It  is  clear  that  all  prior  conditional  expectations  of  zero  will  be  zero  so  that  E(ut )  =  0.  Similarly, 
taking  the  conditional  expectations  of  E(ut/(t )  based  on  information  prior  to  period  t  —  1,  we 
get 


E[E{u2t/Ct)Kt-i]  =  7o  +  hE[u2_i/ Ct-i]  =  7o  +  7i(7o  +  7i  «?-2)  =  7o(l  +  7i)  +  7i“t-2 


By  taking  repeated  conditional  expectations  one  period  at  a  time  we  finally  get 
E{u2t)  =  70(1  +  7i  +  7?  +  ••  +  7tf1)  +  iWo 


(14.24) 
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As  t  — »  oo,  the  unconditional  variance  of  ut  is  given  by  a2  =  var(rq)  =  70/(l  —  7x)  for  I7-J  <  1 
and  70  >  0.  Therefore,  the  ARCH(l)  process  is  homoskedastic. 

ARCH  models  can  be  estimated  using  feasible  GLS  or  maximum  likelihood  methods.  Alterna¬ 
tively,  one  can  use  a  double-length  regression  procedure  suggested  by  Davidson  and  MacKinnon 
(1993)  to  obtain  (i)  one-step  efficient  estimates  starting  from  OLS  estimates  or  (ii)  the  max¬ 
imum  likelihood  estimates.  Here  we  focus  on  the  feasible  GLS  procedure  suggested  by  Engle 
(1982).  For  the  regression  model 


y  =  X/3  +  u 


(14.25) 


where  y  is  T  x  1  and  X  is  T  x  k.  First,  obtain  the  OLS  estimates  Pols  and  the  OLS  residuals 
e.  Second,  perform  the  following  regression:  e2  =  a0  +  aie^_1+  residuals.  This  yields  a  test  for 
homoskedasticity.  Third,  compute  a2  =  a0  +  a\e2_l  and  regress  [(e^/Sy)  —  1]  on  (1  /at)  and 
(e2_1/d t).  Call  the  regression  estimates  da.  One  updates  a'  =  (a0,  cl±)  by  computing  a  =  a  +  da. 
Fourth,  recompute  of  using  the  updated  a  from  step  3,  and  form  the  set  of  regressors  xtjVt  for 
j  =  1, . . . ,  k,  where 


' 1  ,  _ 

( diet  \2 

n  = 

—  +  2 

07 

K^t+J 

1/2 


Finally,  regress  ( etSt/rt )  where 


L  _  _“L_  (i f±l 

°7+l  \0t+ 1 


(14.26) 


on  xtjTt  for  j  =  1, . . . ,  k  and  obtain  the  least  squares  coefficients  dp.  Update  the  estimate  of 
P  by  computing  (3  =  Pols  +  dp.  This  procedure  can  run  into  problems  if  the  of  are  not  all 
positive,  see  Judge  et  al.  (1985)  and  Engle  (1982)  for  details. 

The  ARCH  model  has  been  generalized  by  Bollerslev  (1986).  The  Generalized  ARCH 
(GARCH  (p,  q))  model  can  be  written  as 


=  7o  +  £i=i  +  Ef=i 


(14.27) 


In  this  case,  the  conditional  variance  of  ut  depends  upon  q  of  its  lagged  values  as  well  as  p 
squared  lagged  values  of  ut-  The  simple  GARCH  (1,1)  model  is  given  by 


°2t=lo  +  7i«t-i  +  ^1  at-i 


(14.28) 


An  LM  test  for  GARCH  (p,  q )  turns  out  to  be  equivalent  to  testing  ARCH  (p  +  q).  This  simply 
regresses  squared  OLS  residuals  on  {p  +  q)  of  its  squared  lagged  values.  The  test  statistic 
is  T  times  the  uncentered  R2  and  is  asymptotically  distributed  as  Xp+q  under  the  null  of 
homoskedasticity. 

In  conclusion,  a  lot  of  basic  concepts  have  been  introduced  in  this  chapter  and  we  barely 
scratched  the  surface.  Hopefully,  this  will  motivate  the  reader  to  take  the  next  econometrics 
time  series  course. 
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Table  14.3  GARCH  (1,1)  model 


Dependent  Variable:  CONSUMP 

Method:  ML  -  ARCH  (Marquardt)  -  Normal  distribution 

Sample:  1959  2007 

Included  observations:  49 

Convergence  achieved  after  19  iterations 

Presample  variance:  backcast  (parameter  =  0.7) 

GARCH  =  C(3)  ±  C (4) *RESID (-1)  * 2  ±  C(5)*GARCH(-1) 

Coefficient 

Std.  Error  z-Statistic 

Prob. 

C 

-1435.888 

226.3933  -6.342449 

0.0000 

Y 

0.986813 

0.011923  82.76452 

0.0000 

Variance  Equation 

C 

118728.2 

87402.35  1.358410 

0.1743 

RESID(-l)"  2 

1.068561 

0.326091  3.276885 

0.0010 

GARCH  (-1) 

-0.380949 

0.187656  -2.030036 

0.0424 

R-squared 

0.993542 

Mean  dependent  var 

16749.10 

Adjusted  R-squared 

0.992955 

S.D.  dependent  var 

5447.060 

S.E.  of  regression 

457.1938 

Akaike  info  criterion 

14.86890 

Sum  squared  resid 

9197153. 

Schwarz  criterion 

15.06194 

Log  likelihood 

-359.2880 

Hannah-Quinn  criter. 

14.94214 

F-statistic 

1692.353 

Durbin- Watson  stat 

0.178409 

Prob(F-statistic) 

0.000000 

Note 

1.  Granger  causality  has  been  developed  by  Granger  (1969).  For  another  definition  of  causality, 
see  Sims  (1972).  Also,  Chamberlain  (1982)  for  a  discussion  on  when  these  two  definitions  are 
equivalent. 

Problems 

1.  For  the  AR(1)  model 

Ut  =  PVt-i  ±  £t  t  =  1,2, ...  ,T-  with  \p\  <  1  and  et  ~  IIN(0,  cr2) 

(a)  Show  that  if  y0  ~  IV (0,  cr2/l  —  p2),  then  E(yt)  =  0  for  all  t  and  var (yt)  =  <r2/(l  —  p2)  so  that 
the  mean  and  variance  are  independent  of  t.  Note  that  if  p  =  1  then  var(yt)  is  oo.  If  \p\  >  1 
then  var (yt)  is  negative! 

(b)  Show  that  cov(yt,  yt-s)  =  pscr2  which  is  only  dependent  on  s,  the  distance  between  the  two 
time  periods.  Conclude  from  parts  (a)  and  (b)  that  this  AR(1)  model  is  weakly  stationary. 

(c)  Generate  the  above  AR(1)  series  for  T  =  250,  a2  =  0.25  and  various  values  of  p  =  ±0.9,  ±0.8, 
±0.5,  ±0.3  and  ±0.1.  Plot  the  AR(1)  series  and  the  autocorrelation  function  ps  versus  s. 

2.  For  the  MA(1)  model 

yt  =  et+9et- i  t  =  1,2, . . .  ,T\  with  et  ~  IIN(0, a2) 
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(a)  Show  that  E(yt)  =  0  and  var (yt)  =  a2(l  +  92)  so  that  the  mean  and  variance  are  independent 
of  t. 

(b)  Show  that  cov(yt,yt—i)  =  9a 2  and  cov(yt,yt_s)  =  0  for  s  >  1  which  is  only  dependent  on  s, 
the  distance  between  the  two  time  periods.  Conclude  from  parts  (a)  and  (b)  that  this  MA(1) 
model  is  weakly  stationary. 

(c)  Generate  the  above  MA(1)  series  for  T  =  250,  a2  =  0.25  and  various  values  of  9  =  0.9,  0.8, 
0.5,  0.3  and  0.1.  Plot  the  MA(1)  series  and  the  autocorrelation  function  versus  s. 

3.  Using  the  consumption-personal  disposable  income  data  for  the  U.S.  used  in  this  chapter: 

(a)  Compute  the  sample  autocorrelation  function  for  personal  disposable  income  (Yj).  Plot  the 
sample  correlogram.  Repeat  for  the  first-differenced  series  (AY)).  Compute  the  Ljung-Box 
Qlb  statistic,  test  that  ifQ;  ps  =  0  for  s  =  1, . . . ,  20. 

(b)  Run  the  Augmented  Dickey-Fuller  test  for  the  existence  of  a  unit  root  in  personal  disposable 
income  ( Yt ). 

(c)  Define  Yt  =  A Yt  and  run  A Yt  on  Y)_i  and  a  constant  and  trend.  Test  that  the  first-differenced 
series  of  personal  disposable  income  is  stationary.  What  do  you  conclude?  Is  Yt  an  7(1) 
process? 

(d)  Replicate  the  regression  in  (14.21)  and  verify  the  Engle-Granger  (1987)  test  for  cointegration. 

(e)  Replicate  the  GARCH(1,1)  model  given  in  Table  14.3. 

(f)  Repeat  parts  (a)  through  (e)  using  logC  and  logF.  Are  there  any  changes  in  the  above 
results? 

4.  (a)  Generate  T  =  25  observations  on  xt  and  yt  as  independent  random  walks  with  IIN(0,1) 

disturbances.  Run  the  regression  yt  =  a  +  flxt  +  Ut  and  test  the  null  hypothesis  Ha\  (3  =  0 
using  the  usual  t-statistic  at  the  1%,  5%  and  10%  levels.  Repeat  this  experiment  1000  times 
and  report  the  frequency  of  rejections  at  each  significance  level.  What  do  you  conclude? 

(b)  Repeat  part  (a)  for  T  =  100  and  T  =  500. 

(c)  Repeat  parts  (a)  and  (b)  generating  xt  and  yt  as  independent  random  walks  with  drift  as 
described  in  (14.11),  using  IIN(0,1)  disturbances.  Let  7  =  0.2  for  both  series. 

(d)  Repeat  parts  (a)  and  (b)  generating  a :t  and  yt  as  independent  trend  stationary  series  as 
described  in  (14.10),  using  IIN(0, 1)  disturbances.  Let  a  =  1  and  f3  =  0.04  for  both  series. 

(e)  Report  the  frequency  distributions  of  the  R2  statistics  obtained  in  parts  (a)  through  (d)  for 
each  sample  size  and  method  of  generating  the  time-series.  What  do  you  conclude?  Hint: 
See  the  Monte  Carlo  experiments  in  Granger  and  Newbold  (1974),  Davidson  and  MacKinnon 
(1993)  and  Banerjee,  Dolado,  Galbraith  and  Hendry  (1993). 

5.  For  the  Money  Supply,  GNP  and  interest  rate  series  data  for  the  U.S.  given  on  the  Springer  web 
site  as  MACRO. ASC,  fit  a  VAR  three  equation  model  using: 

(a)  Two  lags  on  each  variable. 

(b)  Three  lags  on  each  variable. 

(c)  Compute  the  Likelihood  Ratio  test  for  part  (a)  versus  part  (b) . 

(d)  For  the  two-equation  VAR  of  Money  Supply  and  interest  rate  with  three  lags  on  each  variable, 
test  that  the  interest  rate  does  not  Granger  cause  the  money  supply? 

(e)  How  sensitive  are  the  tests  in  part  (d)  if  we  had  used  only  two  lags  on  each  variable. 
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6.  For  the  simple  Deterministic  Time  Trend  Model 


yt  =  a  +  /3t  +  ut  t  =  l,..,T 
where  ut  ~  IIN(0,cr2). 

(a)  Show  that 


O-OLS  ~  ol 

Pols  ~~  P 


(X'X)X'u  = 


T  Ef=i  t  ■ 

-1 

1 - 

53 

hlii 

w 

1 _ 

eL  t  eL*2. 

_  Et=i  tut  _ 

where  the  f-th  observation  of  A,  the  matrix  of  regressors,  is  [l,f]. 

(b)  Use  the  results  that  Y^t=i  t  =  T(T  +  l)/2  and  YPt=i  P2  =  T(T  +  1)(2T  +  l)/6  to  show  that 
plim  ( X'X/T )  as  T  — >  oo  is  not  a  positive  definite  matrix. 

(c)  Use  the  fact  that 


(  VT(aoLS  —  a)  \ 

V  tVtCPols  -  P)  ) 

where  A 


A{X'X)-1AA-1{X'u)  =  (A-1(X,X)A-1)-1A~\X,u) 

(  VT  0  \ 

^  o  tVT  ) 


is  the  2x2  nonsingular  matrix,  to  show  that  plim  (A  1(X'X)A  *)  is  the  finite  positive 
definite  matrix 


/ 

Q  = 

V 


1 

1 

2 


1 

2 

1 

3 


/ 


A^iX'u) 


Ehut/Vr 
Ef=i  tut/TVT 


(d)  Show  that  z\  =  Y^t=iut/VT  is  A(0,tr2)  and  Z2  =  Et=i  tut/T\/T  is  N(0,a2(T  +  1)(2T  + 
1)/6T2)  with  cov(zi,z2)  =  (T  +  1)ct2/2T,  so  that 


( 

/ 

1 

T+l 

\ 

( !' ) 

~  N 

0  ,cr2 

2  T 

\z2  ) 

T  +  l 

(T  +  1)(2T  +  1) 

V 

2T 

6  T2 

/ 

Conclude  that  as  T  — >  oo,  the  asymptotic  distribution  of 


Zl 

Z2 


is  iV(0,  a2Q). 


parts  (c)  and  (d),  conclude  that  the  asymptotic  distribution  of 
is  N( 0,  (t2Q_1).  Since  Pols  ^as  fact°r  T\JT  rather  than  the  usual 


(e)  Using  the  results  in 

VT(ools  —  a) 

tVtCPols  -P) 

VT,  it  is  said  to  be  superconsistent.  This  means  that  not  only  does  ( Pols  ~~  P)  converge  to 
zero  in  probability  limits,  but  so  does  T(/30ls  ~  P)-  Note  that  the  normality  assumption 
is  not  needed  for  this  result.  Using  the  central  limit  theorem,  all  that  is  needed  is  that  Ut 
is  White  noise  with  finite  fourth  moments,  see  Sims,  Stock  and  Watson  (1990)  or  Hamilton 
(1994). 


7.  Test  of  Hypothesis  with  a  Deterministic  Time  Trend  Model.  This  is  based  on  Hamilton  (1994).  In 
problem  6,  we  showed  that  aoLS  and  /30ls  converged  at  different  rates,  ypT  and  TypT  respectively. 
Despite  this  fact,  the  usual  least  squares  t  and  U-statistics  are  asymptotically  valid  even  when  the 
ut  s  are  not  Normally  distributed. 
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(a)  Show  that  s2  =  Yn=i(Vt  ~  Pols  -  PoLsrf/iT  ~  2)  has  Plim  s2  =  ^ ■ 

(b)  In  order  to  test  Ha\  a  =  a0,  the  usual  least  squares  package  computes 

ta  =  (Pols  -  ao)/[s2(l,0)(A',A)_1(l,0)']1/2 

where  (X'X)  is  given  in  problem  6.  Multiply  the  numerator  and  denominator  by  vT  and 
use  the  results  of  part  (c)  of  problem  6  to  show  that  this  f-statistic  has  the  same  asymptotic 
distribution  as  f*  =  VT(cxols  —  0Lo)/(T\fq^  where  q 11  is  the  (1, 1)  element  of  Q-1  defined  in 
problem  6.  t*  has  an  asymptotic  N(0, 1)  distribution  using  the  results  of  part  (e)  in  problem 
6. 

(c)  Similarly,  to  test  H0;  (3  =  (30,  the  usual  least  squares  package  computes 

tp  =  (Pols  -  P)/\s2( 0,  lXM)"1^  1)']1/2. 

Multiply  the  numerator  and  denominator  by  T \fT  and  use  the  results  of  part  (c)  of  problem 
6  to  show  that  this  t-statistic  has  the  same  asymptotic  distribution  as  t*p  =  TVT(/3ols  — 
P)/a\/cl 22  where  q 22  is  the  (2,  2)  element  of  Q-1  defined  in  problem  6.  t*p  has  an  asymptotic 
1V(0, 1)  distribution  using  the  results  of  part  (e)  in  problem  6. 

8.  A  Random  Walk  Model.  This  is  based  on  Fuller  (1976)  and  Hamilton  (1994).  Consider  the  following 
random  walk  model 


Vt  =  Ut-i  +  ut  t  =  0, 1, . . . ,  T  where  ut  ~  IIN(0,  a2)  and  ya  =  0. 

(a)  Show  that  yt  can  be  written  as  yt  =  U\  +  U2  +  ■■  +  ut  with  E(yt)  =  0  and  var (yt)  =  to 2  so 
that  yt  ~  1V(0,  to2). 

(b)  Square  the  random  walk  equation  y2  =  ( yt-i  +  Ut)2  and  solve  for  yt-\Ut-  Sum  this  over 
t  =  1, 2, . . . ,  T  and  show  that 

ELi  Vt-iut  =  (Vt/ 2)  -  ELi  ut  /2 

Divide  by  To 2  and  show  that  Etli  Vt-iUt/To2  is  asymptotically  distributed  as  (xi  ~  l)/2- 
Hint:  Use  the  fact  that  yr  ~  N(0,To2). 

(c)  Using  the  fact  that  yt-\  ~  N( 0,  (t  —  1  )o2)  show  that  E  (^E/Li  Vt-ij  =  &2T(T  —  l)/2.  Hint: 
Use  the  expression  for  E?=i  ^  in  problem  6. 

(d)  Suppose  we  had  estimated  an  AR(1)  model  rather  than  a  random  walk,  i.e. ,  yt  =  pyt- 1  +  ut 
when  the  true  p  =  1.  The  OLS  estimate  is 

p  =  Ef=i  yt-m/ Ef=i  Vt-i  =  p  +  Ef=i  yt-mt/ ELi  Vt-i 

Show  that 

rr,-.  S  r  Tj^yt-iUt/To2 

plim  T(p  -p)  =  phm— T -  =  0 

Et=i  Vt—iM2®2 

Note  that  the  numerator  was  considered  in  part  (b),  while  the  denominator  was  considered  in  part 
(c).  One  can  see  that  the  asymptotic  distribution  of  'p  when  p  =  1  is  a  ratio  of  (xi  —  l)/2  random 
variable  to  a  non-standard  distribution  in  the  denominator  which  is  beyond  the  scope  of  this  book, 
see  Hamilton  (1994)  or  Fuller  (1976)  for  further  details.  The  object  of  this  exercise  is  to  show  that 
if  p  —  1,  VT{fi  —  p)  is  no  longer  normal  as  in  the  standard  stationary  least  squares  regression  with 
\p\  <  1.  Also,  to  show  that  for  the  nonstationary  (random  walk)  model,  'p  converges  at  a  faster 
rate  (T)  than  for  the  stationary  case  (VT).  From  part  (c)  it  is  clear  that  one  has  to  divide  the 
denominator  of  'p  by  T 2  rather  than  T  to  get  a  convergent  distribution. 
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9.  Consider  the  cointegration  example  given  in  (14.13)  and  (14.14). 

(a)  Verify  equations  (14.15)-(14.20). 

(b)  Show  that  the  OLS  estimator  of  (3  obtained  by  regressing  Ct  on  Yt  is  superconsistent ,  i.e., 
show  that  plim  T(/3OLS  —  /3)  — >  0  as  T  — >  oo. 
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Appendix 


$(1.65)  =  pr [z  <  1.65]  =  0.9505 


Table  A  Area  under  the  Standard  Normal  Distribution 


z 

0.00 

0.01 

0.02 

0.03 

0.04 

0.05 

0.06 

0.07 

0.08 

0.09 

0.0 

0.5000 

0.5040 

0.5080 

0.5120 

0.5160 

0.5199 

0.5239 

0.5279 

0.5319 

0.5359 

0.1 

0.5398 

0.5438 

0.5478 

0.5517 

0.5557 

0.5596 

0.5636 

0.5675 

0.5714 

0.5753 

0.2 

0.5793 

0.5832 

0.5871 

0.5910 

0.5948 

0.5987 

0.6026 

0.6064 

0.6103 

0.6141 

0.3 

0.6179 

0.6217 

0.6255 

0.6293 

0.6331 

0.6368 

0.6406 

0.6443 

0.6480 

0.6517 

0.4 

0.6554 

0.6591 

0.6628 

0.6664 

0.6700 

0.6736 

0.6772 

0.6808 

0.6844 

0.6879 

0.5 

0.6915 

0.6950 

0.6985 

0.7019 

0.7054 

0.7088 

0.7123 

0.7157 

0.7190 

0.7224 

0.6 

0.7257 

0.7291 

0.7324 

0.7357 

0.7389 

0.7422 

0.7454 

0.7486 

0.7517 

0.7549 

0.7 

0.7580 

0.7611 

0.7642 

0.7673 

0.7704 

0.7734 

0.7764 

0.7794 

0.7823 

0.7852 

0.8 

0.7881 

0.7910 

0.7939 

0.7967 

0.7995 

0.8023 

0.8051 

0.8078 

0.8106 

0.8133 

0.9 

0.8159 

0.8186 

0.8212 

0.8238 

0.8264 

0.8289 

0.8315 

0.8340 

0.8365 

0.8389 

1.0 

0.8413 

0.8438 

0.8461 

0.8485 

0.8508 

0.8531 

0.8554 

0.8577 

0.8599 

0.8621 

1.1 

0.8643 

0.8665 

0.8686 

0.8708 

0.8729 

0.8749 

0.8770 

0.8790 

0.8810 

0.8830 

1.2 

0.8849 

0.8869 

0.8888 

0.8907 

0.8925 

0.8944 

0.8962 

0.8980 

0.8997 

0.9015 

1.3 

0.9032 

0.9049 

0.9066 

0.9082 

0.9099 

0.9115 

0.9131 

0.9147 

0.9162 

0.9177 

1.4 

0.9192 

0.9207 

0.9222 

0.9236 

0.9251 

0.9265 

0.9279 

0.9292 

0.9306 

0.9319 

1.5 

0.9332 

0.9345 

0.9357 

0.9370 

0.9382 

0.9394 

0.9406 

0.9418 

0.9429 

0.9441 

1.6 

0.9452 

0.9463 

0.9474 

0.9484 

0.9495 

0.9505 

0.9515 

0.9525 

0.9535 

0.9545 

1.7 

0.9554 

0.9564 

0.9573 

0.9582 

0.9591 

0.9599 

0.9608 

0.9616 

0.9625 

0.9633 

1.8 

0.9641 

0.9649 

0.9656 

0.9664 

0.9671 

0.9678 

0.9686 

0.9693 

0.9699 

0.9706 

1.9 

0.9713 

0.9719 

0.9726 

0.9732 

0.9738 

0.9744 

0.9750 

0.9756 

0.9761 

0.9767 

2.0 

0.9772 

0.9778 

0.9783 

0.9788 

0.9793 

0.9798 

0.9803 

0.9808 

0.9812 

0.9817 

2.1 

0.9821 

0.9826 

0.9830 

0.9834 

0.9838 

0.9842 

0.9846 

0.9850 

0.9854 

0.9857 

2.2 

0.9861 

0.9864 

0.9868 

0.9871 

0.9875 

0.9878 

0.9881 

0.9884 

0.9887 

0.9890 

2.3 

0.9893 

0.9896 

0.9898 

0.9901 

0.9904 

0.9906 

0.9909 

0.9911 

0.9913 

0.9916 

2.4 

0.9918 

0.9920 

0.9922 

0.9925 

0.9927 

0.9929 

0.9931 

0.9932 

0.9934 

0.9936 

2.5 

0.9938 

0.9940 

0.9941 

0.9943 

0.9945 

0.9946 

0.9948 

0.9949 

0.9951 

0.9952 

2.6 

0.9953 

0.9955 

0.9956 

0.9957 

0.9959 

0.9960 

0.9961 

0.9962 

0.9963 

0.9964 

2.7 

0.9965 

0.9966 

0.9967 

0.9968 

0.9969 

0.9970 

0.9971 

0.9972 

0.9973 

0.9974 

2.8 

0.9974 

0.9975 

0.9976 

0.9977 

0.9977 

0.9978 

0.9979 

0.9979 

0.9980 

0.9981 

2.9 

0.9981 

0.9982 

0.9982 

0.9983 

0.9984 

0.9984 

0.9985 

0.9985 

0.9986 

0.9986 

3.0 

0.9987 

0.9987 

0.9987 

0.9988 

0.9988 

0.9989 

0.9989 

0.9989 

0.9990 

0.9990 

Source:  The  SAS®  function  PROBNORM  was  used  to  generate  this  table. 
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Pr[t8  >ta  =  2.306]  =  0.025 


Table  B  Right-Tail  Critical  Values  for  the  t-Distribution 


DF 

c*=0.1 

c*=0.05 

c«=0.025 

c*=0.01 

c*=0.005 

1 

3.0777 

6.3138 

12.7062 

31.8205 

63.6567 

2 

1.8856 

2.9200 

4.3027 

6.9646 

9.9248 

3 

1.6377 

2.3534 

3.1824 

4.5407 

5.8409 

4 

1.5332 

2.1318 

2.7764 

3.7469 

4.6041 

5 

1.4759 

2.0150 

2.5706 

3.3649 

4.0321 

6 

1.4398 

1.9432 

2.4469 

3.1427 

3.7074 

7 

1.4149 

1.8946 

2.3646 

2.9980 

3.4995 

8 

1.3968 

1.8595 

2.3060 

2.8965 

3.3554 

9 

1.3830 

1.8331 

2.2622 

2.8214 

3.2498 

10 

1.3722 

1.8125 

2.2281 

2.7638 

3.1693 

11 

1.3634 

1.7959 

2.2010 

2.7181 

3.1058 

12 

1.3562 

1.7823 

2.1788 

2.6810 

3.0545 

13 

1.3502 

1.7709 

2.1604 

2.6503 

3.0123 

14 

1.3450 

1.7613 

2.1448 

2.6245 

2.9768 

15 

1.3406 

1.7531 

2.1314 

2.6025 

2.9467 

16 

1.3368 

1.7459 

2.1199 

2.5835 

2.9208 

17 

1.3334 

1.7396 

2.1098 

2.5669 

2.8982 

18 

1.3304 

1.7341 

2.1009 

2.5524 

2.8784 

19 

1.3277 

1.7291 

2.0930 

2.5395 

2.8609 

20 

1.3253 

1.7247 

2.0860 

2.5280 

2.8453 

21 

1.3232 

1.7207 

2.0796 

2.5176 

2.8314 

22 

1.3212 

1.7171 

2.0739 

2.5083 

2.8188 

23 

1.3195 

1.7139 

2.0687 

2.4999 

2.8073 

24 

1.3178 

1.7109 

2.0639 

2.4922 

2.7969 

25 

1.3163 

1.7081 

2.0595 

2.4851 

2.7874 

26 

1.3150 

1.7056 

2.0555 

2.4786 

2.7787 

27 

1.3137 

1.7033 

2.0518 

2.4727 

2.7707 

28 

1.3125 

1.7011 

2.0484 

2.4671 

2.7633 

29 

1.3114 

1.6991 

2.0452 

2.4620 

2.7564 

30 

1.3104 

1.6973 

2.0423 

2.4573 

2.7500 

31 

1.3095 

1.6955 

2.0395 

2.4528 

2.7440 

32 

1.3086 

1.6939 

2.0369 

2.4487 

2.7385 

33 

1.3077 

1.6924 

2.0345 

2.4448 

2.7333 

34 

1.3070 

1.6909 

2.0322 

2.4411 

2.7284 

35 

1.3062 

1.6896 

2.0301 

2.4377 

2.7238 

36 

1.3055 

1.6883 

2.0281 

2.4345 

2.7195 

37 

1.3049 

1.6871 

2.0262 

2.4314 

2.7154 

38 

1.3042 

1.6860 

2.0244 

2.4286 

2.7116 

39 

1.3036 

1.6849 

2.0227 

2.4258 

2.7079 

40 

1.3031 

1.6839 

2.0211 

2.4233 

2.7045 

Source :  The  SAS®  function  TINY  was  used  to  generate  this  table. 
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Source:  The  SAS®  function  FINV  was  used  to  generate  this  table.  v\=  numerator  degrees  of  freedom  V2=  denominator  degrees  of  freedom 


Table  D  Right-Tail  Critical  Values  for  the  F-Distribution:  Upper  1%  Points 
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Spatial  correlation,  223,  230,  231 
Spearman’s  Rank  Correlation  test,  104,  105, 
107,  123 

Specification  analysis 
overspecification,  77 
underspecification,  77 
Specification  error 


Differencing  test,  197,  198,  204,  205, 
217-220 

Specification  error  tests,  192,  196,  219,  221 
Spectral  decomposition,  175,  309,  310 
Spurious  regression,  373,  384,  386,  387,  395, 
396 

Stationarity,  106,  233,  373-375,  377,  379- 
386,  390,  391,  393-395 
covariance  stationary,  374,  379 
difference  stationary,  373,  380,  383 
trend  stationary,  373,  383,  384,  391,  394 
Stationary  process,  234,  374,  380,  385 
Stochastic  explanatory  variables,  96,  97 
Studentized  residuals,  180,  183-187,  216 
Sufficient  statistic,  20,  37,  39,  57,  158 
Superconsistent,  386,  392,  394 

Tobit  model,  356,  358,  359,  362 
Truncated  regression  model,  359,  360 
Truncated  uniform  density,  363 
Two-stage  least  squares,  129,  142,  261,  263, 
264,  270 

Uniform  distribution,  13,  38,  45 
Unit  root,  322,  323,  373,  379-384,  386,  391, 
394-396 

Unordered  response  models,  350,  354 

Vector  Autoregression  (VAR),  373,  378,  379, 
386,  391 

Wald  test,  26-29,  37,  38,  42,  162,  165-167, 
170-173,  224,  229,  237,  321,  342, 
363 

Weighted  Least  Squares,  100,  121,  125,  309, 
334 

White  noise,  143,  375,  377,  392 
White  test,  100,  105,  106,  108,  109,  112,  123, 
125,  126,  129,  202,  221,  238,  395 
Within  estimator,  307,  311-314,  320,  321, 
325,  327,  329,  330 

Zero  mean  assumption,  51-54,  95,  96,  98, 
102,  110,  111,  122,  152,  176,  179, 
190,  202,  217,  227,  234,  311,  327 


